Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Link Analysis: An Information Science Approach
Nội dung xem thử
Mô tả chi tiết
Link Analysis
An Information Science Approach
Recent and Forthcoming Volumes
Leo Egghe
Power Laws in the Information Production Process: Lotkaian Informetrics
Donald Case
Looking for Information
Matthew Locke Saxton and John V. Richardson
Understanding Reference Transactions: Turning Art Into a Science
Robert M. Hayes
Models for Library Management, Decision-Making, and Planning
Charles T. Meadow, Bert R. Boyce, and Donald H. Kraft
Text Information Retrieval Systems, Second Edition
Charles T. Meadow
Text Information Retrieval Systems
A.J. Meadows
Communicating Research
V. Frants,J. Shapiro, & V. Votskunskii
Automated Information Retrieval: Theory and Methods
Harold Sackman
Biomedical Information Technology: Global Social Responsibilities for the
Democratic Age
Peter Clayton
Implementation of Organizational Innovation: Studies of Academic and Research
Libraries
Bryce L. Allen
Information Tasks: Toward a User-Centered Approach to Information Systems
Library and Information Science
Series Editor: Bert R. Boyce
School of Library & Information Science
Louisiana State University, Baton Rouge
Mike Thelwall
2004
ELSEVIER
ACADEMIC
PRESS
Amsterdam - Boston - Heidelberg - London - New York - Oxford
Paris - San Diego - San Francisco - Singapore - Sydney - Tokyo
Link Analysis
An Information Science Approach
ELSEVIER B.V. ELSEVIER Inc. ELSEVIER Ltd. ELSEVIER Ltd.
Radarweg 29 525 B Street, Suite 1900 The Boulevard, Langford Lane 84 Theobalds Road
P.O. Box 211, 1000 AE Amsterdam San Diego, CA 92101-4495 Kidlington, Oxford OX5 1GB London WC1X 8RR
The Netherlands USA UK UK
© 2004 Elsevier Inc. All rights reserved.
This work is protected under copyright by Elsevier Inc., and the following terms and conditions apply to its use:
Photocopying
Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission
of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying,
copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available
for educational institutions that wish to make photocopies for non-profit educational classroom use.
Permissions may be sought directly from Elsevier's Rights Department in Oxford, UK: phone (+44) 1865 843830, fax
(+44) 1865 853333, email: [email protected]. Requests may also be completed on-line via the Elsevier
homepage (http://www.elsevier.com/locate/ permissions).
In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222
Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through
the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK;
phone: (+44) 20 7631 5555; fax: (+44) 20 7631 5500. Other countries may have a local reprographic rights agency for
payments.
Derivative Works
Tables of contents may be reproduced for internal circulation, but permission of the Publisher is required for external
resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including
compilations and translations.
Electronic Storage or Usage
Permission of the Publisher is required to store or use electronically any material contained in this work, including any
chapter or part of a chapter.
Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the
Publisher.
Address permissions requests to: Elsevier's Rights Department, at the fax and e-mail addresses noted above.
Notice
No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products
liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained
in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of
diagnoses and drug dosages should be made.
First edition 2004
ISBN: 0-12-088553-0
@ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).
Printed in The Netherlands.
Introduction v
Link Analysis: An Information Science Approach
Part I: Theory 1
1 1
Introduction 1
Objectives 1
Link analysis 1
Historical overview 2
What is the information science approach to link analysis? 3
Contents and structure 4
Key terminology 5
Summary 6
Further reading 6
References 7
2 9
Web crawlers and search engines 9
Objectives 9
Introduction 9
Web crawlers 9
Finding pages 11
Content crawling vs. URL crawling 11
Content crawling 14
Obscured links 14
Depth and other arbitrary limitations 15
Automatically generated pages 15
Ethical issues and robots.txt 17
The web page 17
Web crawling summary 18
Search engines 18
Known biases 19
Search engine ranking 20
The Internet Archive 20
Summary 20
Further reading 21
References 21
3 23
The theoretical perspective for link counting 23
Objectives 23
Introduction 23
The theoretical perspective for link counting 23
Anomalies 24
Manual filtering and banned lists 26
Alternative Document Models 27
Web sites and web documents 27
ADMs and standard ADM counting 29
ADM range counting models 30
Choosing link counting strategies 31
vi Link Analysis: An Information Science Approach
Summary 32
Further reading 32
References 33
4 35
Interpreting link counts: Random samples and correlations 35
Objectives 35
Introduction 35
Interpreting link counts 35
The pilot feasibility and validity study 37
Full-scale random sampling 38
Confidence limits for categories 40
Correlation testing 41
Literature review 43
Summary 43
Further reading 43
References 44
Part II: web structure 47
5 47
Link structures in the web graph 47
Objectives 47
Introduction 47
Power laws in the web 48
Models of web growth 50
Link topologies 52
Power laws and link topologies in academic webs 54
Summary 55
Further reading 56
References 56
6 59
The content structure of the web 59
Objectives 59
Introduction 59
The topic structure of the web 60
A link-content web growth model 61
Link text 62
The subject structure of academic webs 62
Colinks 66
Summary 66
Further reading 67
References 67
III Academic links 69
7 69
Universities: Link types 69
Objectives 69
Introduction 69
Citation analysis 69
The role of a university web site 70
Introduction vii
National systems of university web sites 71
Page types 72
Link types 75
Summary 77
Further reading 78
References 78
8 81
Universities: Link models 81
Objectives 81
Introduction 81
The relationship between inlinks and research 81
Academic linking: Quality vs. quantity 84
Alternative logical linking models 86
Mathematical models 87
The influence of geography 88
Regional groupings 89
Summary 91
References 91
9 93
Universities: International links 93
Objectives 93
Introduction 93
National vs. international links 94
International linking comparisons 95
Linguistic influences 96
Summary 98
Further reading 99
References 99
10 101
Departments and subjects 101
Objectives 101
Introduction 101
Departmental web sites 102
Disciplinary differences in link types 103
issues of scale and correlation tests 104
Country 105
Subject 105
Outcome 105
Geographic and international factors 106
Summary 106
Further reading 107
References 107
11 109
Journals and articles 109
Objectives 109
Introduction 109
Journal Impact Factors 109
Journal web sites 110
viii Link Analysis: An Information Science Approach
Journal web site inlinks: Issues Ill
Journal web site inlinks: Case study 112
Types of links in journal articles 113
Digital library links 114
Combined link and log file analysis 114
Related research topics 115
Summary 116
Further reading 116
References 116
IV Applications 119
12 119
Search engines and web design 119
Objectives 119
Introduction 119
Link structures and crawler coverage 119
Text in web sites and the Vector Space Model 120
The PageRank algorithm 121
Case study: PageRank calculations for a gateway site 124
HITS 127
HITS worked example 128
Summary: Web site design for PageRank and HITS 131
Further reading 132
Appendix: the Vector Space Model 133
References 134
13 137
A health check for Spanish universities 137
objective 137
Introduction 137
Research questions 137
Methods 138
Results and discussion 138
Conclusion 144
References 144
14 145
Personal web pages linking to universities 145
Objectives 145
Introduction 145
Web publishing and personal home pages 146
Research questions 147
Methods 148
Data collection 148
Data analysis 149
Results 151
ISP bias test 151
ADM fitting 152
Correlations between links and research ratings 153
A comparison of university and home page link sources 154
Introduction ix
Individual page categorizations 155
Conclusion 158
Meta-conclusions 159
Acknowedgement 159
References 160
15 163
Academic networks 163
Objectives 163
Introduction 163
Methods 163
University sitemaps 164
National academic web maps 168
Subject maps 170
Summary 171
Further reading 171
References 172
16 173
Business web sites 173
Objectives 173
Introduction 173
Site coverage checks 173
Site indexing and ranking checks 174
Competitive intelligence 174
Case study 175
Center Pares 176
Hoseasons 176
Butlins 177
Pontins 178
Haven Holidays 178
General queries 179
Summary 179
Further reading 180
References 180
V Tools and techniques 181
17 181
Using commercial search engines and the Internet Archive 181
Objectives 181
Introduction 181
Checking results 182
Dealing with variations in results 183
Using multiple search engines 184
Using the Internet Archive 184
Summary 185
Online resources 185
Further reading 186
References 186
18 189
Personal crawlers 189
x Link Analysis: An Information Science Approach
Objectives 189
Introduction 189
Types of personal crawler 189
SocSciBot 190
Web page retrieved 190
Web page qualification 191
Web link extraction 192
URLs from HTTP 192
Obscured or unspecified URLs 193
Server-generated pages 193
Dealing with errors 194
Human intervention during crawls 195
SocSciBot tools 195
Summary 196
Online resources 196
Further reading 196
References 197
19 199
Data cleansing 199
Objectives 199
Introduction 199
Overview of data cleansing techniques 199
Anomaly identification 200
TLD Spectral Analysis 201
Summary 201
Online resources 202
References 202
20 203
Online university link databases 203
Objective 203
Introduction 203
Overview of the link databases 203
Link structure files 204
The banned lists 205
Analyzing the data 206
Other link structure databases 206
Summary 206
Online resources 206
Further reading 206
Reference 208
21 209
Embedded link analysis methodologies 209
Objectives 209
Introduction 209
Web Sphere Analysis 210
Virtual ethnography 210
Summary 211
Introduction xi
Further reading 212
References 212
22 213
Social Network Analysis 213
Objectives 213
Introduction 213
Some SNA metrics 214
Software 215
Summary 216
Further reading 216
References 216
23 219
Network visualizations 219
Objectives 219
Introduction 219
Network diagrams 219
Large network diagrams 221
MultiDimensional Scaling 221
Self-Organizing Maps 222
Knowledge Domain Visualisation 223
Summary 223
Online resources 223
References 223
24 227
Academic link indicators 227
Objective 227
Introduction 227
Web indicators as process indicators 228
Issues of size and reliability 228
Benchmarking indicators 230
Link metrics 230
Relational indicators 232
Other metrics 232
Summary 233
Further reading 233
References 234
VI Summary 237
25 237
Summary 237
Objectives 237
Introduction 237
information science contributions to link analysis 238
Other link analysis approaches 239
Future directions 240
26 241
Glossary 241
References 243
Appendix 245
xii Link Analysis: An Information Science Approach
A SocSciBot tutorial 245
Tutorial 245
Step 1: Installing SocSciBot, SocSciBot Tools and Cyclist 245
Step 2: Installing Pajek 247
Step 3: Crawling a first site with SocSciBot 247
Step 4: Crawling two more sites with SocSciBot 252
Step 5: Viewing basic reports about the "small test" project with SocSciBot Tools 253
Step 6: Viewing a network diagram with Pajek 257
Step 7: Viewing site diagrams with Pajek 261
Step 8: Using Cyclist 263
Summary 264
Index 265
Introduction 1
PART I: THEORY
INTRODUCTION
OBJECTIVES
• To introduce the content and structure of the book and some key terminology.
• To outline the information science approach to link analysis.
LINK ANALYSIS
Link analysis is performed in very diverse subjects, from computer science and theoretical
physics to information science, communication studies and sociology. This is a testament both
to the importance of the web and to a widespread belief that hyperlinks between web pages
can yield useful information of one kind or another. This belief probably stems from several
related factors: the success of Google, which uses a link-based algorithm for identifying the
best pages; analogies with other phenomena, such as journal citations and social connections;
and probably also links being 'in your face' all the time, whether using the web for research,
business or recreation.
In this book, an information science approach to link analysis is set out with the
principle aim of introducing it to a new audience. This new audience will then be able to
critically evaluate existing research and develop their own research projects and methods. It is
a central belief of this book that the information science approach is widely useful to other
researchers, particularly social scientists interested in analyzing phenomena with an online
component. No attempt is made to give comprehensive coverage of all different types of link
analysis: such an enterprise would fail between the detail of the mathematics used in some
areas and the qualitative approach used in others. The information science theme of the book
1
2 Link Analysis: An Information Science Approach
has resulted in at least half of its content being related to the study of academic web use or
scholarly communication. Readers may therefore also gain additional insights into scholarly
communication.
The book seeks to answer four main questions.
• Which kinds of information can be extracted by analyzing the hyperlinks between a
set of web pages or sites?
• Which techniques should be used?
• What are the likely pitfalls of link analysis?
• How can and should a link analysis be conducted in practice?
HISTORICAL OVERVIEW
The start of published web link analysis research appears to date from 1995-1996, occurring
simultaneously in several disciplines, including computer science for search engine
development (e.g., Weiss, Velez, Sheldon et al., 1996), and mathematics for structure and
complexity analysis (e.g., Abraham, 1996). The first information scientist to publish a
discussion of the potential for transferring information science techniques to the Internet
appears to be the Brazilian Marcia J. Bossy (1995), with an article in a French online journal.
The first published information science link analysis seems to be that of Larson (1996). His
"Bibliometries of the World Wide Web: An exploratory analysis of the intellectual structure
of cyberspace" presentation at the American Society for Information Science conference
explicitly adapted existing information science techniques from bibliometrics to the web.
Larson's objective was to assess the link structure of a topic on the web (Earth Sciences) and
the characteristics of highly linked to documents.
Shortly following Larson's presentation, a number of other information scientists also
realized that advanced features of search engines could be used for an information sciencestyle link analysis. This produced Rousseau's (1997) informetric analysis of the web and
Rodriguez f Gairin's (1997) web citation analysis, the latter describing the search engine
AltaVista as the web's 'citation index'.
Two other important developments occurred in parallel with the genesis of link
analysis: the foundation of a journal and the development of a theoretical orientation for
information science web research. Almind and Ingwersen (1997) coined the term
'webometrics' for the quantitative analysis of web-related phenomena from an information
science perspective. Most webometrics research has, so far, focused on hyperlinks, although
there have also been quantitative analyses of search engine results and longitudinal
investigations into web page changes. The term 'cybermetrics' emerged at the same time as
webometrics and is almost synonymous: the difference being that cybermetrics includes
quantitative analysis of the Internet, not just the web. A key instigator of this term was Isidro
Aguillo, who founded the e-journal Cybermetrics in 1997.
Since 1997, there have been a large number of link analysis studies taking an
information science approach (Thelwall, Vaughan & Bjomeborn, 2005). These have
collectively produced the developed body of theory and methods that is summarized in this
book.
Introduction 3
WHAT IS THE INFORMATION SCIENCE APPROACH TO LINK
ANALYSIS?
The information science approach to link analysis is to adopt and adapt existing information
science techniques for the meta-analysis of documents through investigating inter-document
connections. This set of existing techniques is part of two overlapping fields of study:
bibliometrics, the quantitative analysis of documents; and scientometrics, the quantitative
analysis of science and its outputs. Within the overlap of these two fields a number of
techniques for analyzing scientific publications have been developed, principally for journal
articles and patents, and using citations as the key inter-document connectors. The surface
similarity between hyperlinks and citations is that they are both directional links between
documents, often documents created by different authors. There is an extensive body of
research and theory concerning citations (e.g., Borgman & Furner, 2002) that serves as a
starting point for an information science approach to link analysis. There is a historical
parallel: citation analysis techniques have been adapted from their original information
science home of journal citations to patent citations (Oppenheim, 2000), in response to the
increasing commercialization of research.
An information science approach to link analysis
1) Formulate an appropriate research question, taking into account
existing knowledge of web structure (>chapters 5, 6, and chapters
7-16 as appropriate).
2) Conduct a pilot study (>chapter 4).
3) Identify web pages or sites that are appropriate to address a
research question.
4) Collect link data from a commercial search engine or a personal
crawler, taking appropriate safeguards to ensure that the results
obtained are accurate (>chapter 17 or 18).
5) Apply data cleansing techniques to the links, if possible, and select
an appropriate counting method (>chapters 3 and 19).
6) Partially validate the link count results through correlation tests
(>chapter 4).
7) Partially validate the interpretation of the results through a link
classification exercise (>chapter 4).
8) Report results with an interpretation consistent with link
classification exercise, including either a detailed description of the
classification or exemplars to illustrate the categories (>chapter 4).
9) Report the limitations of the study and parameters used in data
collection and processing (stages 3 to 5) (>chapters 3, 4)
The information science approach to link analysis is outlined in the box above. Those familiar
with citation analysis will see strong parallels, but these are not directly commented upon.
There are two central themes, the first being information. The objective of the link analysis is
to deliver useful information. Other types of link analysis may have different objectives, such
as identifying abstract mathematical patterns or improving the performance of web