Siêu thị PDFTải ngay đi em, trời tối mất

Thư viện tri thức trực tuyến

Kho tài liệu với 50,000+ tài liệu học thuật

© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Modeling the internet and the web
PREMIUM
Số trang
306
Kích thước
3.7 MB
Định dạng
PDF
Lượt xem
1587

Modeling the internet and the web

Nội dung xem thử

Mô tả chi tiết

Modeling the Internet and the Web

This Page Intentionally Left Blank

Modeling the Internet and the Web

Probabilistic Methods and Algorithms

Pierre Baldi

School of Information and Computer Science,

University of California, Irvine, USA

Paolo Frasconi

Department of Systems and Computer Science,

University of Florence, Italy

Padhraic Smyth

School of Information and Computer Science,

University of California, Irvine, USA

Copyright © 2003 Pierre Baldi, Paolo Frasconi and Padhraic Smyth

Published by John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,

West Sussex PO19 8SQ, England

Phone (+44) 1243 779777

Email (for orders and customer service enquiries): [email protected]

Visit our Home Page on www.wileyeurope.com or www.wiley.com

All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or

transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or

otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of

a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP,

UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed

to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West

Sussex PO19 8SQ, England, or emailed to [email protected], or faxed to (+44) 1243 770620.

This publication is designed to provide accurate and authoritative information in regard to the subject matter

covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services.

If professional advice or other expert assistance is required, the services of a competent professional should

be sought.

Other Wiley Editorial Offices

John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA

Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA

Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany

John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia

John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809

John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may

not be available in electronic books.

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

ISBN 0-470-84906-1

Typeset in 10/12pt Times by T&T Productions Ltd, London.

Printed and bound in Great Britain by Biddles Ltd, Guildford, Surrey.

This book is printed on acid-free paper responsibly manufactured from sustainable forestry

in which at least two trees are planted for each one used for paper production.

To Ezio and José (P.B.), to Neda (P.F.) and

to Seosamh and Bríd Áine (P.S.)

This Page Intentionally Left Blank

Contents

Preface xiii

1 Mathematical Background 1

1.1 Probability and Learning from a Bayesian Perspective 1

1.2 Parameter Estimation from Data 4

1.2.1 Basic principles 4

1.2.2 A simple die example 6

1.3 Mixture Models and the Expectation Maximization Algorithm 10

1.4 Graphical Models 13

1.4.1 Bayesian networks 13

1.4.2 Belief propagation 15

1.4.3 Learning directed graphical models from data 16

1.5 Classification 17

1.6 Clustering 20

1.7 Power-Law Distributions 22

1.7.1 Definition 22

1.7.2 Scale-free properties (80/20 rule) 24

1.7.3 Applications to Languages: Zipf’s and Heaps’ Laws 24

1.7.4 Origin of power-law distributions and Fermi’s model 26

1.8 Exercises 27

2 Basic WWW Technologies 29

2.1 Web Documents 30

2.1.1 SGML and HTML 30

2.1.2 General structure of an HTML document 31

2.1.3 Links 32

2.2 Resource Identifiers: URI, URL, and URN 33

2.3 Protocols 36

2.3.1 Reference models and TCP/IP 36

2.3.2 The domain name system 37

2.3.3 The Hypertext Transfer Protocol 38

2.3.4 Programming examples 40

viii CONTENTS

2.4 Log Files 41

2.5 Search Engines 44

2.5.1 Overview 44

2.5.2 Coverage 45

2.5.3 Basic crawling 46

2.6 Exercises 49

3 Web Graphs 51

3.1 Internet and Web Graphs 51

3.1.1 Power-law size 53

3.1.2 Power-law connectivity 53

3.1.3 Small-world networks 56

3.1.4 Power law of PageRank 57

3.1.5 The bow-tie structure 58

3.2 Generative Models for the Web Graph and Other Networks 60

3.2.1 Web page growth 60

3.2.2 Lattice perturbation models: between order and disorder 61

3.2.3 Preferential attachment models, or the rich get richer 63

3.2.4 Copy models 66

3.2.5 PageRank models 67

3.3 Applications 68

3.3.1 Distributed search algorithms 68

3.3.2 Subgraph patterns and communities 70

3.3.3 Robustness and vulnerability 72

3.4 Notes and Additional Technical References 73

3.5 Exercises 74

4 Text Analysis 77

4.1 Indexing 77

4.1.1 Basic concepts 77

4.1.2 Compression techniques 79

4.2 Lexical Processing 80

4.2.1 Tokenization 80

4.2.2 Text conflation and vocabulary reduction 82

4.3 Content-Based Ranking 82

4.3.1 The vector-space model 82

4.3.2 Document similarity 83

4.3.3 Retrieval and evaluation measures 85

4.4 Probabilistic Retrieval 86

4.5 Latent Semantic Analysis 88

4.5.1 LSI and text documents 89

4.5.2 Probabilistic LSA 89

4.6 Text Categorization 93

CONTENTS ix

4.6.1 k nearest neighbors 93

4.6.2 The Naive Bayes classifier 94

4.6.3 Support vector classifiers 97

4.6.4 Feature selection 102

4.6.5 Measures of performance 104

4.6.6 Applications 106

4.6.7 Supervised learning with unlabeled data 111

4.7 Exploiting Hyperlinks 114

4.7.1 Co-training 114

4.7.2 Relational learning 115

4.8 Document Clustering 116

4.8.1 Background and examples 116

4.8.2 Clustering algorithms for documents 117

4.8.3 Related approaches 119

4.9 Information Extraction 120

4.10 Exercises 122

5 Link Analysis 125

5.1 Early Approaches to Link Analysis 126

5.2 Nonnegative Matrices and Dominant Eigenvectors 128

5.3 Hubs and Authorities: HITS 131

5.4 PageRank 134

5.5 Stability 138

5.5.1 Stability of HITS 139

5.5.2 Stability of PageRank 139

5.6 Probabilistic Link Analysis 140

5.6.1 SALSA 140

5.6.2 PHITS 142

5.7 Limitations of Link Analysis 143

6 Advanced Crawling Techniques 149

6.1 Selective Crawling 149

6.2 Focused Crawling 152

6.2.1 Focused crawling by relevance prediction 152

6.2.2 Context graphs 154

6.2.3 Reinforcement learning 155

6.2.4 Related intelligent Web agents 157

6.3 Distributed Crawling 158

6.4 Web Dynamics 160

6.4.1 Lifetime and aging of documents 161

6.4.2 Other measures of recency 167

6.4.3 Recency and synchronization policies 167

x CONTENTS

7 Modeling and Understanding Human Behavior on the

Web 171

7.1 Introduction 171

7.2 Web Data and Measurement Issues 172

7.2.1 Background 172

7.2.2 Server-side data 174

7.2.3 Client-side data 177

7.3 Empirical Client-Side Studies of Browsing Behavior 179

7.3.1 Early studies from 1995 to 1997 180

7.3.2 The Cockburn and McKenzie study from 2002 181

7.4 Probabilistic Models of Browsing Behavior 184

7.4.1 Markov models for page prediction 184

7.4.2 Fitting Markov models to observed page-request data 186

7.4.3 Bayesian parameter estimation for Markov models 187

7.4.4 Predicting page requests with Markov models 189

7.4.5 Modeling runlengths within states 193

7.4.6 Modeling session lengths 194

7.4.7 A decision-theoretic surfing model 198

7.4.8 Predicting page requests using additional variables 199

7.5 Modeling and Understanding Search Engine Querying 201

7.5.1 Empirical studies of search behavior 202

7.5.2 Models for search strategies 207

7.6 Exercises 208

8 Commerce on the Web: Models and Applications 211

8.1 Introduction 211

8.2 Customer Data on the Web 212

8.3 Automated Recommender Systems 212

8.3.1 Evaluating recommender systems 214

8.3.2 Nearest-neighbor collaborative filtering 215

8.3.3 Model-based collaborative filtering 218

8.3.4 Model-based combining of votes and content 223

8.4 Networks and Recommendations 224

8.4.1 Email-based product recommendations 224

8.4.2 A diffusion model 226

8.5 Web Path Analysis for Purchase Prediction 228

8.6 Exercises 232

Appendix A Mathematical Complements 235

A.1 Graph Theory 235

A.1.1 Basic definitions 235

A.1.2 Connectivity 236

A.1.3 Random graphs 236

CONTENTS xi

A.2 Distributions 237

A.2.1 Expectation, variance, and covariance 237

A.2.2 Discrete distributions 237

A.2.3 Continuous distributions 238

A.2.4 Weibull distribution 240

A.2.5 Exponential family 240

A.2.6 Extreme value distribution 241

A.3 Singular Value Decomposition 241

A.4 Markov Chains 243

A.5 Information Theory 243

A.5.1 Mathematical background 244

A.5.2 Information, surprise, and relevance 247

Appendix B List of Main Symbols and Abbreviations 253

References 257

Index 277

This Page Intentionally Left Blank

Preface

Since its early ARPANET inception during the Cold War, the Internet has grown

by a staggering nine orders of magnitude. Today, the Internet and the World Wide

Web pervade our lives, having fundamentally altered the way we seek, exchange,

distribute, and process information. The Internet has become a powerful social force,

transforming communication, entertainment, commerce, politics, medicine, science,

and more. It mediates an ever growing fraction of human knowledge, forming both

the largest library and the largest marketplace on planet Earth.

Unlike the invention of earlier media such as the press, photography, or even the

radio, which created specialized passive media, the Internet and the Web impact all

information, converting it to a uniform digital format of bits and packets. In addition,

the Internet and the Web form a dynamic medium, allowing software applications

to control, search, modify, and filter information without human intervention. For

example, email messages can carry programs that affect the behavior of the receiving

computer. This active medium also promotes human intervention in sharing, updating,

linking, embellishing, critiquing, corrupting, etc., information to a degree that far

exceeds what could be achieved with printed documents.

In common usage, the words ‘Internet’ and ‘Web’ (or World Wide Web or WWW)

are often used interchangeably. Although they are intimately related, there are of

course some nuances which we have tried to respect. ‘Internet’, in particular, is the

more general term and implicitly includes physical aspects of the underlying networks

as well as mechanisms such as email and peer-to-peer activities that are not directly

associated with the Web. The term ‘Web’, on the other hand, is associated with the

information stored and available on the Internet. It is also a term that points to other

complex networks of information, such as webs of scientific citations, social relations,

or even protein interactions. In this sense, it is fair to say that a predominant fraction

of our book is about theWeb and the information aspects of the Internet.We use ‘Web’

every time we refer to the World Wide Web and ‘web’ when we refer to a broader

class of networks or other kinds of networks, i.e. web of citations.

As the Internet and the Web continue to expand at an exponential rate, it also

evolves in terms of the devices and processors connected to it, e.g. wireless devices

and appliances. Ever more human domains and activities are ensnared by the Web,

thus creating challenging problems of ownership, security, and privacy. For instance,

xiv PREFACE

we are quite far from having solved the security, privacy, and authentication problems

that would allow us to hold national Internet elections.

As scientists, the Web has also become a tool we use on a daily basis for tasks

ranging from the mundane to the intractable, to search and disseminate information,

to exchange views and collaborate, to post job listings, to retrieve and quote (by

Uniform Resource Locator (URL)) bibliographic information, to build Web servers,

and even to compute. There is hardly a branch of computer science that is not affected

by the Internet: not only the most obvious areas such as networking and protocols,

but also security and cryptography; scientific computing; human interfaces, graphics,

and visualization; information retrieval, data mining, machine learning, language/text

modeling and artificial intelligence, to name just a few.

What is perhaps less obvious and central to this book is that not only have the

Web and the Internet become essential tools of scientific enterprise, but they have

also themselves become the objects of active scientific investigation. And not only for

computer scientists and engineers, but also for mathematicians, economists, social

scientists, and even biologists.

There are many reasons why the Internet and the Web are exciting, albeit young,

topics for scientific investigation. These reasons go beyond the need to improve the

underlying technology and to harness the Web for commercial applications. Because

the Internet and the Web can be viewed as dynamic constellations of interconnected

processors and Web pages, respectively, they can be monitored in many ways and

at many different levels of granularity, ranging from packet traffic, to user behavior,

to the graphical structure of Web pages and their hyperlinks. These measurements

provide new types of large-scale data sets that can be scientifically analyzed and

‘mined’ at different levels. Thus researchers enjoy unprecedented opportunities to,

for instance:

• gather, communicate, and exchange ideas, documents, and information;

• monitor a large dynamic network with billions of nodes and one order of mag￾nitude more connections;

• gather large training sets of textual or activity data, for the purposes of modeling

and predicting the behavior of millions of users;

• analyze and understand interests and relationships within society.

The Web, for instance, can be viewed as an example of a very large distributed and

dynamic system with billions of pages resulting from the uncoordinated actions of

millions of individuals. After all, anyone can post a Web page on the Internet and link

it to any other page. In spite of this complete lack of central control, the graphical

structure of the Web is far from random and possesses emergent properties shared

with other complex graphs found in social, technological, and biological systems.

Examples of properties include the power-law distribution of vertex connectivities

and the small-world property – any two Web pages are usually only a few clicks

away from each other. Similarly, predictable patterns of congestion (e.g. traffic jams)

Tải ngay đi em, còn do dự, trời tối mất!