Siêu thị PDFTải ngay đi em, trời tối mất

Thư viện tri thức trực tuyến

Kho tài liệu với 50,000+ tài liệu học thuật

© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Link Analysis: An Information Science Approach
PREMIUM
Số trang
282
Kích thước
31.5 MB
Định dạng
PDF
Lượt xem
1798

Link Analysis: An Information Science Approach

Nội dung xem thử

Mô tả chi tiết

Link Analysis

An Information Science Approach

Recent and Forthcoming Volumes

Leo Egghe

Power Laws in the Information Production Process: Lotkaian Informetrics

Donald Case

Looking for Information

Matthew Locke Saxton and John V. Richardson

Understanding Reference Transactions: Turning Art Into a Science

Robert M. Hayes

Models for Library Management, Decision-Making, and Planning

Charles T. Meadow, Bert R. Boyce, and Donald H. Kraft

Text Information Retrieval Systems, Second Edition

Charles T. Meadow

Text Information Retrieval Systems

A.J. Meadows

Communicating Research

V. Frants,J. Shapiro, & V. Votskunskii

Automated Information Retrieval: Theory and Methods

Harold Sackman

Biomedical Information Technology: Global Social Responsibilities for the

Democratic Age

Peter Clayton

Implementation of Organizational Innovation: Studies of Academic and Research

Libraries

Bryce L. Allen

Information Tasks: Toward a User-Centered Approach to Information Systems

Library and Information Science

Series Editor: Bert R. Boyce

School of Library & Information Science

Louisiana State University, Baton Rouge

Mike Thelwall

2004

ELSEVIER

ACADEMIC

PRESS

Amsterdam - Boston - Heidelberg - London - New York - Oxford

Paris - San Diego - San Francisco - Singapore - Sydney - Tokyo

Link Analysis

An Information Science Approach

ELSEVIER B.V. ELSEVIER Inc. ELSEVIER Ltd. ELSEVIER Ltd.

Radarweg 29 525 B Street, Suite 1900 The Boulevard, Langford Lane 84 Theobalds Road

P.O. Box 211, 1000 AE Amsterdam San Diego, CA 92101-4495 Kidlington, Oxford OX5 1GB London WC1X 8RR

The Netherlands USA UK UK

© 2004 Elsevier Inc. All rights reserved.

This work is protected under copyright by Elsevier Inc., and the following terms and conditions apply to its use:

Photocopying

Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission

of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying,

copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available

for educational institutions that wish to make photocopies for non-profit educational classroom use.

Permissions may be sought directly from Elsevier's Rights Department in Oxford, UK: phone (+44) 1865 843830, fax

(+44) 1865 853333, email: [email protected]. Requests may also be completed on-line via the Elsevier

homepage (http://www.elsevier.com/locate/ permissions).

In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222

Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through

the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK;

phone: (+44) 20 7631 5555; fax: (+44) 20 7631 5500. Other countries may have a local reprographic rights agency for

payments.

Derivative Works

Tables of contents may be reproduced for internal circulation, but permission of the Publisher is required for external

resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including

compilations and translations.

Electronic Storage or Usage

Permission of the Publisher is required to store or use electronically any material contained in this work, including any

chapter or part of a chapter.

Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form

or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the

Publisher.

Address permissions requests to: Elsevier's Rights Department, at the fax and e-mail addresses noted above.

Notice

No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products

liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained

in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of

diagnoses and drug dosages should be made.

First edition 2004

ISBN: 0-12-088553-0

@ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).

Printed in The Netherlands.

Introduction v

Link Analysis: An Information Science Approach

Part I: Theory 1

1 1

Introduction 1

Objectives 1

Link analysis 1

Historical overview 2

What is the information science approach to link analysis? 3

Contents and structure 4

Key terminology 5

Summary 6

Further reading 6

References 7

2 9

Web crawlers and search engines 9

Objectives 9

Introduction 9

Web crawlers 9

Finding pages 11

Content crawling vs. URL crawling 11

Content crawling 14

Obscured links 14

Depth and other arbitrary limitations 15

Automatically generated pages 15

Ethical issues and robots.txt 17

The web page 17

Web crawling summary 18

Search engines 18

Known biases 19

Search engine ranking 20

The Internet Archive 20

Summary 20

Further reading 21

References 21

3 23

The theoretical perspective for link counting 23

Objectives 23

Introduction 23

The theoretical perspective for link counting 23

Anomalies 24

Manual filtering and banned lists 26

Alternative Document Models 27

Web sites and web documents 27

ADMs and standard ADM counting 29

ADM range counting models 30

Choosing link counting strategies 31

vi Link Analysis: An Information Science Approach

Summary 32

Further reading 32

References 33

4 35

Interpreting link counts: Random samples and correlations 35

Objectives 35

Introduction 35

Interpreting link counts 35

The pilot feasibility and validity study 37

Full-scale random sampling 38

Confidence limits for categories 40

Correlation testing 41

Literature review 43

Summary 43

Further reading 43

References 44

Part II: web structure 47

5 47

Link structures in the web graph 47

Objectives 47

Introduction 47

Power laws in the web 48

Models of web growth 50

Link topologies 52

Power laws and link topologies in academic webs 54

Summary 55

Further reading 56

References 56

6 59

The content structure of the web 59

Objectives 59

Introduction 59

The topic structure of the web 60

A link-content web growth model 61

Link text 62

The subject structure of academic webs 62

Colinks 66

Summary 66

Further reading 67

References 67

III Academic links 69

7 69

Universities: Link types 69

Objectives 69

Introduction 69

Citation analysis 69

The role of a university web site 70

Introduction vii

National systems of university web sites 71

Page types 72

Link types 75

Summary 77

Further reading 78

References 78

8 81

Universities: Link models 81

Objectives 81

Introduction 81

The relationship between inlinks and research 81

Academic linking: Quality vs. quantity 84

Alternative logical linking models 86

Mathematical models 87

The influence of geography 88

Regional groupings 89

Summary 91

References 91

9 93

Universities: International links 93

Objectives 93

Introduction 93

National vs. international links 94

International linking comparisons 95

Linguistic influences 96

Summary 98

Further reading 99

References 99

10 101

Departments and subjects 101

Objectives 101

Introduction 101

Departmental web sites 102

Disciplinary differences in link types 103

issues of scale and correlation tests 104

Country 105

Subject 105

Outcome 105

Geographic and international factors 106

Summary 106

Further reading 107

References 107

11 109

Journals and articles 109

Objectives 109

Introduction 109

Journal Impact Factors 109

Journal web sites 110

viii Link Analysis: An Information Science Approach

Journal web site inlinks: Issues Ill

Journal web site inlinks: Case study 112

Types of links in journal articles 113

Digital library links 114

Combined link and log file analysis 114

Related research topics 115

Summary 116

Further reading 116

References 116

IV Applications 119

12 119

Search engines and web design 119

Objectives 119

Introduction 119

Link structures and crawler coverage 119

Text in web sites and the Vector Space Model 120

The PageRank algorithm 121

Case study: PageRank calculations for a gateway site 124

HITS 127

HITS worked example 128

Summary: Web site design for PageRank and HITS 131

Further reading 132

Appendix: the Vector Space Model 133

References 134

13 137

A health check for Spanish universities 137

objective 137

Introduction 137

Research questions 137

Methods 138

Results and discussion 138

Conclusion 144

References 144

14 145

Personal web pages linking to universities 145

Objectives 145

Introduction 145

Web publishing and personal home pages 146

Research questions 147

Methods 148

Data collection 148

Data analysis 149

Results 151

ISP bias test 151

ADM fitting 152

Correlations between links and research ratings 153

A comparison of university and home page link sources 154

Introduction ix

Individual page categorizations 155

Conclusion 158

Meta-conclusions 159

Acknowedgement 159

References 160

15 163

Academic networks 163

Objectives 163

Introduction 163

Methods 163

University sitemaps 164

National academic web maps 168

Subject maps 170

Summary 171

Further reading 171

References 172

16 173

Business web sites 173

Objectives 173

Introduction 173

Site coverage checks 173

Site indexing and ranking checks 174

Competitive intelligence 174

Case study 175

Center Pares 176

Hoseasons 176

Butlins 177

Pontins 178

Haven Holidays 178

General queries 179

Summary 179

Further reading 180

References 180

V Tools and techniques 181

17 181

Using commercial search engines and the Internet Archive 181

Objectives 181

Introduction 181

Checking results 182

Dealing with variations in results 183

Using multiple search engines 184

Using the Internet Archive 184

Summary 185

Online resources 185

Further reading 186

References 186

18 189

Personal crawlers 189

x Link Analysis: An Information Science Approach

Objectives 189

Introduction 189

Types of personal crawler 189

SocSciBot 190

Web page retrieved 190

Web page qualification 191

Web link extraction 192

URLs from HTTP 192

Obscured or unspecified URLs 193

Server-generated pages 193

Dealing with errors 194

Human intervention during crawls 195

SocSciBot tools 195

Summary 196

Online resources 196

Further reading 196

References 197

19 199

Data cleansing 199

Objectives 199

Introduction 199

Overview of data cleansing techniques 199

Anomaly identification 200

TLD Spectral Analysis 201

Summary 201

Online resources 202

References 202

20 203

Online university link databases 203

Objective 203

Introduction 203

Overview of the link databases 203

Link structure files 204

The banned lists 205

Analyzing the data 206

Other link structure databases 206

Summary 206

Online resources 206

Further reading 206

Reference 208

21 209

Embedded link analysis methodologies 209

Objectives 209

Introduction 209

Web Sphere Analysis 210

Virtual ethnography 210

Summary 211

Introduction xi

Further reading 212

References 212

22 213

Social Network Analysis 213

Objectives 213

Introduction 213

Some SNA metrics 214

Software 215

Summary 216

Further reading 216

References 216

23 219

Network visualizations 219

Objectives 219

Introduction 219

Network diagrams 219

Large network diagrams 221

MultiDimensional Scaling 221

Self-Organizing Maps 222

Knowledge Domain Visualisation 223

Summary 223

Online resources 223

References 223

24 227

Academic link indicators 227

Objective 227

Introduction 227

Web indicators as process indicators 228

Issues of size and reliability 228

Benchmarking indicators 230

Link metrics 230

Relational indicators 232

Other metrics 232

Summary 233

Further reading 233

References 234

VI Summary 237

25 237

Summary 237

Objectives 237

Introduction 237

information science contributions to link analysis 238

Other link analysis approaches 239

Future directions 240

26 241

Glossary 241

References 243

Appendix 245

xii Link Analysis: An Information Science Approach

A SocSciBot tutorial 245

Tutorial 245

Step 1: Installing SocSciBot, SocSciBot Tools and Cyclist 245

Step 2: Installing Pajek 247

Step 3: Crawling a first site with SocSciBot 247

Step 4: Crawling two more sites with SocSciBot 252

Step 5: Viewing basic reports about the "small test" project with SocSciBot Tools 253

Step 6: Viewing a network diagram with Pajek 257

Step 7: Viewing site diagrams with Pajek 261

Step 8: Using Cyclist 263

Summary 264

Index 265

Introduction 1

PART I: THEORY

INTRODUCTION

OBJECTIVES

• To introduce the content and structure of the book and some key terminology.

• To outline the information science approach to link analysis.

LINK ANALYSIS

Link analysis is performed in very diverse subjects, from computer science and theoretical

physics to information science, communication studies and sociology. This is a testament both

to the importance of the web and to a widespread belief that hyperlinks between web pages

can yield useful information of one kind or another. This belief probably stems from several

related factors: the success of Google, which uses a link-based algorithm for identifying the

best pages; analogies with other phenomena, such as journal citations and social connections;

and probably also links being 'in your face' all the time, whether using the web for research,

business or recreation.

In this book, an information science approach to link analysis is set out with the

principle aim of introducing it to a new audience. This new audience will then be able to

critically evaluate existing research and develop their own research projects and methods. It is

a central belief of this book that the information science approach is widely useful to other

researchers, particularly social scientists interested in analyzing phenomena with an online

component. No attempt is made to give comprehensive coverage of all different types of link

analysis: such an enterprise would fail between the detail of the mathematics used in some

areas and the qualitative approach used in others. The information science theme of the book

1

2 Link Analysis: An Information Science Approach

has resulted in at least half of its content being related to the study of academic web use or

scholarly communication. Readers may therefore also gain additional insights into scholarly

communication.

The book seeks to answer four main questions.

• Which kinds of information can be extracted by analyzing the hyperlinks between a

set of web pages or sites?

• Which techniques should be used?

• What are the likely pitfalls of link analysis?

• How can and should a link analysis be conducted in practice?

HISTORICAL OVERVIEW

The start of published web link analysis research appears to date from 1995-1996, occurring

simultaneously in several disciplines, including computer science for search engine

development (e.g., Weiss, Velez, Sheldon et al., 1996), and mathematics for structure and

complexity analysis (e.g., Abraham, 1996). The first information scientist to publish a

discussion of the potential for transferring information science techniques to the Internet

appears to be the Brazilian Marcia J. Bossy (1995), with an article in a French online journal.

The first published information science link analysis seems to be that of Larson (1996). His

"Bibliometries of the World Wide Web: An exploratory analysis of the intellectual structure

of cyberspace" presentation at the American Society for Information Science conference

explicitly adapted existing information science techniques from bibliometrics to the web.

Larson's objective was to assess the link structure of a topic on the web (Earth Sciences) and

the characteristics of highly linked to documents.

Shortly following Larson's presentation, a number of other information scientists also

realized that advanced features of search engines could be used for an information science￾style link analysis. This produced Rousseau's (1997) informetric analysis of the web and

Rodriguez f Gairin's (1997) web citation analysis, the latter describing the search engine

AltaVista as the web's 'citation index'.

Two other important developments occurred in parallel with the genesis of link

analysis: the foundation of a journal and the development of a theoretical orientation for

information science web research. Almind and Ingwersen (1997) coined the term

'webometrics' for the quantitative analysis of web-related phenomena from an information

science perspective. Most webometrics research has, so far, focused on hyperlinks, although

there have also been quantitative analyses of search engine results and longitudinal

investigations into web page changes. The term 'cybermetrics' emerged at the same time as

webometrics and is almost synonymous: the difference being that cybermetrics includes

quantitative analysis of the Internet, not just the web. A key instigator of this term was Isidro

Aguillo, who founded the e-journal Cybermetrics in 1997.

Since 1997, there have been a large number of link analysis studies taking an

information science approach (Thelwall, Vaughan & Bjomeborn, 2005). These have

collectively produced the developed body of theory and methods that is summarized in this

book.

Introduction 3

WHAT IS THE INFORMATION SCIENCE APPROACH TO LINK

ANALYSIS?

The information science approach to link analysis is to adopt and adapt existing information

science techniques for the meta-analysis of documents through investigating inter-document

connections. This set of existing techniques is part of two overlapping fields of study:

bibliometrics, the quantitative analysis of documents; and scientometrics, the quantitative

analysis of science and its outputs. Within the overlap of these two fields a number of

techniques for analyzing scientific publications have been developed, principally for journal

articles and patents, and using citations as the key inter-document connectors. The surface

similarity between hyperlinks and citations is that they are both directional links between

documents, often documents created by different authors. There is an extensive body of

research and theory concerning citations (e.g., Borgman & Furner, 2002) that serves as a

starting point for an information science approach to link analysis. There is a historical

parallel: citation analysis techniques have been adapted from their original information

science home of journal citations to patent citations (Oppenheim, 2000), in response to the

increasing commercialization of research.

An information science approach to link analysis

1) Formulate an appropriate research question, taking into account

existing knowledge of web structure (>chapters 5, 6, and chapters

7-16 as appropriate).

2) Conduct a pilot study (>chapter 4).

3) Identify web pages or sites that are appropriate to address a

research question.

4) Collect link data from a commercial search engine or a personal

crawler, taking appropriate safeguards to ensure that the results

obtained are accurate (>chapter 17 or 18).

5) Apply data cleansing techniques to the links, if possible, and select

an appropriate counting method (>chapters 3 and 19).

6) Partially validate the link count results through correlation tests

(>chapter 4).

7) Partially validate the interpretation of the results through a link

classification exercise (>chapter 4).

8) Report results with an interpretation consistent with link

classification exercise, including either a detailed description of the

classification or exemplars to illustrate the categories (>chapter 4).

9) Report the limitations of the study and parameters used in data

collection and processing (stages 3 to 5) (>chapters 3, 4)

The information science approach to link analysis is outlined in the box above. Those familiar

with citation analysis will see strong parallels, but these are not directly commented upon.

There are two central themes, the first being information. The objective of the link analysis is

to deliver useful information. Other types of link analysis may have different objectives, such

as identifying abstract mathematical patterns or improving the performance of web

Tải ngay đi em, còn do dự, trời tối mất!