Siêu thị PDFTải ngay đi em, trời tối mất

Thư viện tri thức trực tuyến

Kho tài liệu với 50,000+ tài liệu học thuật

© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

How to Build a Digital Library (The Morgan Kaufmann Series in Multimedia Information and Systems)
PREMIUM
Số trang
553
Kích thước
10.4 MB
Định dạng
PDF
Lượt xem
1465

How to Build a Digital Library (The Morgan Kaufmann Series in Multimedia Information and Systems)

Nội dung xem thử

Mô tả chi tiết

How to Build a

Digital Library

The Morgan Kaufmann Series in Multimedia Information and Systems

Series Editor, Edward A. Fox, Virginia Polytechnic University

How to Build a Digital Library

Ian H. Witten and David Bainbridge

Digital Watermarking

Ingemar J. Cox, Matthew L. Miller, and Jeffrey A. Bloom

Readings in Multimedia Computing and Networking

Edited by Kevin Jeffay and HongJiang Zhang

Introduction to Data Compression, Second Edition

Khalid Sayood

Multimedia Servers: Applications, Environments, and Design

Dinkar Sitaram and Asit Dan

Managing Gigabytes: Compressing and Indexing Documents and Images,

Second Edition

Ian H. Witten, Alistair Moffat, and Timothy C. Bell

Digital Compression for Multimedia: Principles and Standards

Jerry D. Gibson, Toby Berger, Tom Lookabaugh, Dave Lindbergh, and

Richard L. Baker

Practical Digital Libraries: Books, Bytes, and Bucks

Michael Lesk

Readings in Information Retrieval

Edited by Karen Sparck Jones and Peter Willett

Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital ibraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how to represent the documents theycontain. To do a thorough job we wil have to descend even further and look at the rep￾resentation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia representations.Documents are the digital ibrary’s building blocks.

It is time to step down from our high-level discussion of dig Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital ibraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how to represent the documents they contain. To do a thorough

job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia repre￾sentations.Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of dig Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital ibraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how

to represent the documents they contain. To do a thorough job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate

and internal redundancy that underlies multimedia representations.Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of dig Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital ibraries—what they are, how they are orga￾nized, and what they look like—to nitty-gritty details of how to represent the documents they contain. To do a thorough job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fontsin which those characters are portrayed. For audio, images and video we exam￾ine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia representations.Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of dig Documents are the digital ibrary’s building blocks. It is time to step down from our high-level dis￾cussion of digital libraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how to r

How to Build a

epresent the documents they contain. To do a thorough job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those

Digital Library

Ian H. Witten

Computer Science Department

University of Waikato

David Bainbridge

Computer Science Department

University of Waikato

Publishing Director Diane D. Cerra

Assistant Publishing Services Manager Edward Wade

Senior Developmental Editor Marilyn Uffner Alan

Editorial Assistant Mona Buehler

Project Management Yonie Overton

Cover Design Frances Baca Design

Text Design Mark Ong, Side by Side Studios

Composition Susan Riley, Side by Side Studios

Copyeditor Carol Leyba

Proofreader Ken DellaPenta

Indexer Steve Rath

Printer The Maple-Vail Book Manufacturing Group

Designations used by companies to distinguish their products are often claimed as trademarks or registered

trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names

appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for

more complete information regarding trademarks and registration.

Morgan Kaufmann Publishers

An imprint of Elsevier Science

340 Pine Street, Sixth Floor

San Francisco, CA 94104-3205

www.mkp.com

© 2003 by Elsevier Science (USA)

All rights reserved.

Printed in the United States of America

07 06 05 04 03 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by

any means—electronic, mechanical, photocopying, or otherwise—without the prior written permission of the

publisher.

Library of Congress Control Number: 2002107327

ISBN: 1-55860-790-0

This book is printed on acid-free paper.

v

Contents

List of figures xiii

List of tables xix

Foreword

by Edward A. Fox xxi

Preface xxv

1. Orientation: The world of digital libraries 1

Example One: Supporting human development 1

Example Two: Pushing on the frontiers of science 2

Example Three: Preserving a traditional culture 3

Example Four: Exploring popular music 4

The scope of digital libraries 5

1.1 Libraries and digital libraries 5

1.2 The changing face of libraries 8

In the beginning 10

The information explosion 11

The Alexandrian principle 14

Early technodreams 15

The library catalog 16

The changing nature of books 17

1.3 Digital libraries in developing countries 20

Disseminating humanitarian information 21

Disaster relief 21

Preserving indigenous culture 22

Locally produced information 22

The technological infrastructure 23

1.4 The Greenstone software 24

1.5 The pen is mighty: Wield it wisely 28

Copyright 29

Collecting from the Web 31

Illegal and harmful material 34

Cultural sensitivity 34

1.6 Notes and sources 35

2. Preliminaries: Sorting out the ingredients 39

2.1 Sources of material 40

Ideology 41

Converting an existing library 42

Building a new collection 43

Virtual libraries 44

2.2 Bibliographic organization 46

Objectives of a bibliographic system 47

Bibliographic entities 48

2.3 Modes of access 55

2.4 Digitizing documents 58

Scanning 59

Optical character recognition 61

Interactive OCR 62

Page handling 67

Planning an image digitization project 68

Inside an OCR shop 69

An example project 70

2.5 Notes and sources 73

3. Presentation: User interfaces 77

3.1 Presenting documents 81

Hierarchically structured documents 81

Plain, unstructured text documents 83

vi CONTENTS

Page images 86

Page images and extracted text 88

Audio and photographic images 89

Video 91

Music 92

Foreign languages 93

3.2 Presenting metadata 96

3.3 Searching 99

Types of query 100

Case-folding and stemming 104

Phrase searching 106

Different query interfaces 108

3.4 Browsing 112

Browsing alphabetical lists 113

Ordering lists of words in Chinese 114

Browsing by date 116

Hierarchical classification structures 116

3.5 Phrase browsing 119

A phrase browsing interface 119

Key phrases 122

3.6 Browsing using extracted metadata 124

Acronyms 125

Language identification 126

3.7 Notes and sources 126

Collections 126

Metadata 127

Searching 127

Browsing 128

4. Documents: The raw material 131

4.1 Representing characters 134

Unicode 137

The Unicode character set 138

Composite and combining characters 143

Unicode character encodings 146

Hindi and related scripts 149

Using Unicode in a digital library 154

4.2 Representing documents 155

Plain text 156

CONTENTS vii

Indexing 157

Word segmentation 160

4.3 Page description languages: PostScript and PDF 163

PostScript 164

Fonts 170

Text extraction 173

Using PostScript in a digital library 178

Portable Document Format: PDF 179

PDF and PostScript 183

4.4 Word-processor documents 184

Rich Text Format 185

Native Word formats 191

LaTeX format 191

4.5 Representing images 194

Lossless image compression: GIF and PNG 195

Lossy image compression: JPEG 197

Progressive refinement 203

4.6 Representing audio and video 206

Multimedia compression: MPEG 207

MPEG video 210

MPEG audio 211

Mixing media 212

Other multimedia formats 214

Using multimedia in a digital library 215

4.7 Notes and sources 216

5. Markup and metadata: Elements of organization 221

5.1 Hypertext markup language: HTML 224

Basic HTML 225

Using HTML in a digital library 228

5.2 Extensible markup language: XML 229

Development of markup and stylesheet languages 230

The XML metalanguage 232

Parsing XML 235

Using XML in a digital library 236

5.3 Presenting marked-up documents 237

Cascading style sheets: CSS 237

Extensible stylesheet language: XSL 245

viii CONTENTS

5.4 Bibliographic metadata 253

MARC 254

Dublin Core 257

BibTeX 258

Refer 260

5.5 Metadata for images and multimedia 261

Image metadata: TIFF 262

Multimedia metadata: MPEG-7 263

5.6 Extracting metadata 266

Extracting document metadata 267

Generic entity extraction 268

Bibliographic references 270

Language identification 270

Acronym extraction 271

Key-phrase extraction 273

Phrase hierarchies 277

5.7 Notes and sources 280

6. Construction: Building collections with Greenstone 283

6.1 Why Greenstone? 285

What it does 285

How to use it 288

6.2 Using the Collector 292

Creating a new collection 293

Working with existing collections 300

Document formats 301

6.3 Building collections manually: A walkthrough 302

Getting started 303

Making a framework for the collection 304

Importing the documents 305

Building the indexes 307

Installing the collection 308

6.4 Importing and building 309

Files and directories 310

Object identifiers 312

Plug-ins 313

The import process 314

The build process 317

CONTENTS ix

6.5 Greenstone archive documents 319

Document metadata 320

Inside the documents 322

6.6 Collection configuration file 323

Default configuration file 324

Subcollections and supercollections 325

6.7 Getting the most out of your documents 327

Plug-ins 327

Classifiers 336

Format statements 342

6.8 Building collections graphically 349

6.9 Notes and sources 353

7. Delivery: How Greenstone works 355

7.1 Processes and protocols 356

Processes 357

The null protocol implementation 357

The Corba protocol implementation 359

7.2 Preliminaries 360

The macro language 360

The collection information database 369

7.3 Responding to user requests 372

Performing a search 375

Retrieving a document 376

Browsing a hierarchical classifier 377

Generating the home page 378

Using the protocol 378

Actions 384

7.4 Operational aspects 385

Configuring the receptionist 386

Configuring the site 391

7.5 Notes and sources 392

8. Interoperability: Standards and protocols 393

8.1 More markup 395

Names 395

x CONTENTS

Links 397

Types 402

8.2 Resource description 408

Collection-level metadata 410

8.3 Document exchange 413

Open eBook 414

8.4 Query languages 419

Common command language 419

XML Query 422

8.5 Protocols 426

Z39.50 427

Supporting the Z39.50 protocol 429

The Open Archives Initiative 430

Supporting the OAI protocol 433

8.6 Research protocols 434

Dienst 435

Simple digital library interoperability protocol 436

Translating between protocols 437

Discussion 438

8.7 Notes and sources 440

9. Visions: Future, past, and present 443

9.1 Libraries of the future 445

Today’s visions 445

Tomorrow’s visions 448

Working inside the digital library 451

9.2 Preserving the past 454

The problem of preservation 455

A tale of preservation in the digital era 456

The digital dark ages 457

Preservation strategies 459

9.3 Generalized documents: A challenge for the present 462

Digital libraries of music 462

Other media 466

Generalized documents in Greenstone 469

Digital libraries for oral cultures 471

9.4 Notes and sources 474

CONTENTS xi

Appendix: Installing and operating Greenstone 477

Glossary 481

References 489

Index 499

About the authors 517

xii CONTENTS

xiii

Figures

Figure 1.1 Kataayi’s information and communication center. 2

Figure 1.2 The Zia Pueblo village. 3

Figure 1.3 The New York Public Library. 6

Figure 1.4 Rubbing from a stele in Xi’an. 9

Figure 1.5 A page of the original Trinity College Library catalog. 13

Figure 1.6 The Bibliothèque Nationale de France. 15

Figure 1.7 Artist’s conception of the Memex, Bush’s automated library. 16

Figure 1.8 Part of a page from the Book of Kells. 18

Figure 1.9 Pages from a palm-leaf manuscript in Thanjavur, India. 19

Figure 1.10 Ma¯ori toki or ceremonial adze, emblem of the Greenstone project.

25

Figure 2.1 Scanning and optical character recognition. 59

Figure 2.2 (a) Document image containing different types of data;

(b) the document image segmented into different regions. 64

Figure 2.3 (a) Double-page spread of a Ma¯ori newspaper; (b) enlarged

version; (c) OCR text. 71

Figure 3.1 Finding a quotation in Alice’s Adventures in Wonderland. 78

Figure 3.2 Different-looking digital libraries: (a) Kids’ Digital Library (b)

School Journal Digital Library. 80

Figure 3.3 Village-Level Brickmaking: (a) the book; (b) the chapter on

Moulding; (c, d) some of the pages. 82

Figure 3.4 Alice’s Adventures in Wonderland. 84

Figure 3.5 A story from the School Journal collection: (a) “Never Shout

at a Draft Horse!”; (b) with search term highlighted

(mock-up). 86

Figure 3.6 A historic Ma¯ori newspaper: (a) page image; (b) extracted text.

88

Figure 3.7 Listening to a tape from the Oral History collection. 90

Figure 3.8 Finding Auld Lang Syne in a digital music library. 92

Figure 3.9 Foreign-language collections: (a) French (b) Portuguese interface

to an English collection. 94

Figure 3.10 Documents from two Chinese collections: (a) rubbings of Tang

poetry; (b) classic literature. 95

Figure 3.11 An Arabic collection: (a) a document; (b) searching. 96

Figure 3.12 Bibliography display. 97

Figure 3.13 Metadata examples: (a) bibliography record retrieved from the

Library of Congress; (b) description of a BBC television

program. 98

Figure 3.14 Searching for a quotation: (a) query page; (b) query response.

100

Figure 3.15 Choosing search preferences. 104

Figure 3.16 Large-query search interface. 109

Figure 3.17 Query with history. 110

Figure 3.18 Form search: (a) simple; (b) advanced. 111

Figure 3.19 Browsing an alphabetical list of titles: (a) plain list;

(b) with A–Z tags. 113

Figure 3.20 Browsing a list of titles in Chinese: (a) stroke-based browsing;

(b) Pinyin browsing. 115

Figure 3.21 Browsing by date. 117

Figure 3.22 Browsing a classification hierarchy: (a) the beginning;

(b) expanding Sustainable development; (c) expanding

Organizations, institutions. 118

Figure 3.23 (a) Browsing for information about locusts; (b) expanding on

desert locust; (c) document about desert locusts. 120

Figure 3.24 (a) Browsing for information on poisson; (b) INFOPECHE Web

page. 122

Figure 3.25 Browsing interfaces based on key phrases: (a) hierarchical

browser; (b) document explorer. 123

Figure 3.26 Browsing based on information mined from the document

collection: (a) acronyms; (b) language identification. 125

Figure 4.1 Unicode excerpt: Basic Latin and Latin-1 Supplement

(U+0000–U+00FF). 142

Figure 4.2 Unicode excerpts: (a) Latin Extended A (U+0100-U+017F);

(b) Cyrillic (U+0400-U+045F). 143

xiv FIGURES

Tải ngay đi em, còn do dự, trời tối mất!