Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

How to Build a Digital Library (The Morgan Kaufmann Series in Multimedia Information and Systems)
Nội dung xem thử
Mô tả chi tiết
How to Build a
Digital Library
The Morgan Kaufmann Series in Multimedia Information and Systems
Series Editor, Edward A. Fox, Virginia Polytechnic University
How to Build a Digital Library
Ian H. Witten and David Bainbridge
Digital Watermarking
Ingemar J. Cox, Matthew L. Miller, and Jeffrey A. Bloom
Readings in Multimedia Computing and Networking
Edited by Kevin Jeffay and HongJiang Zhang
Introduction to Data Compression, Second Edition
Khalid Sayood
Multimedia Servers: Applications, Environments, and Design
Dinkar Sitaram and Asit Dan
Managing Gigabytes: Compressing and Indexing Documents and Images,
Second Edition
Ian H. Witten, Alistair Moffat, and Timothy C. Bell
Digital Compression for Multimedia: Principles and Standards
Jerry D. Gibson, Toby Berger, Tom Lookabaugh, Dave Lindbergh, and
Richard L. Baker
Practical Digital Libraries: Books, Bytes, and Bucks
Michael Lesk
Readings in Information Retrieval
Edited by Karen Sparck Jones and Peter Willett
Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital ibraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how to represent the documents theycontain. To do a thorough job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia representations.Documents are the digital ibrary’s building blocks.
It is time to step down from our high-level discussion of dig Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital ibraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how to represent the documents they contain. To do a thorough
job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia representations.Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of dig Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital ibraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how
to represent the documents they contain. To do a thorough job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate
and internal redundancy that underlies multimedia representations.Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of dig Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital ibraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how to represent the documents they contain. To do a thorough job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fontsin which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia representations.Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of dig Documents are the digital ibrary’s building blocks. It is time to step down from our high-level discussion of digital libraries—what they are, how they are organized, and what they look like—to nitty-gritty details of how to r
How to Build a
epresent the documents they contain. To do a thorough job we wil have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those
Digital Library
Ian H. Witten
Computer Science Department
University of Waikato
David Bainbridge
Computer Science Department
University of Waikato
Publishing Director Diane D. Cerra
Assistant Publishing Services Manager Edward Wade
Senior Developmental Editor Marilyn Uffner Alan
Editorial Assistant Mona Buehler
Project Management Yonie Overton
Cover Design Frances Baca Design
Text Design Mark Ong, Side by Side Studios
Composition Susan Riley, Side by Side Studios
Copyeditor Carol Leyba
Proofreader Ken DellaPenta
Indexer Steve Rath
Printer The Maple-Vail Book Manufacturing Group
Designations used by companies to distinguish their products are often claimed as trademarks or registered
trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names
appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for
more complete information regarding trademarks and registration.
Morgan Kaufmann Publishers
An imprint of Elsevier Science
340 Pine Street, Sixth Floor
San Francisco, CA 94104-3205
www.mkp.com
© 2003 by Elsevier Science (USA)
All rights reserved.
Printed in the United States of America
07 06 05 04 03 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by
any means—electronic, mechanical, photocopying, or otherwise—without the prior written permission of the
publisher.
Library of Congress Control Number: 2002107327
ISBN: 1-55860-790-0
This book is printed on acid-free paper.
v
Contents
List of figures xiii
List of tables xix
Foreword
by Edward A. Fox xxi
Preface xxv
1. Orientation: The world of digital libraries 1
Example One: Supporting human development 1
Example Two: Pushing on the frontiers of science 2
Example Three: Preserving a traditional culture 3
Example Four: Exploring popular music 4
The scope of digital libraries 5
1.1 Libraries and digital libraries 5
1.2 The changing face of libraries 8
In the beginning 10
The information explosion 11
The Alexandrian principle 14
Early technodreams 15
The library catalog 16
The changing nature of books 17
1.3 Digital libraries in developing countries 20
Disseminating humanitarian information 21
Disaster relief 21
Preserving indigenous culture 22
Locally produced information 22
The technological infrastructure 23
1.4 The Greenstone software 24
1.5 The pen is mighty: Wield it wisely 28
Copyright 29
Collecting from the Web 31
Illegal and harmful material 34
Cultural sensitivity 34
1.6 Notes and sources 35
2. Preliminaries: Sorting out the ingredients 39
2.1 Sources of material 40
Ideology 41
Converting an existing library 42
Building a new collection 43
Virtual libraries 44
2.2 Bibliographic organization 46
Objectives of a bibliographic system 47
Bibliographic entities 48
2.3 Modes of access 55
2.4 Digitizing documents 58
Scanning 59
Optical character recognition 61
Interactive OCR 62
Page handling 67
Planning an image digitization project 68
Inside an OCR shop 69
An example project 70
2.5 Notes and sources 73
3. Presentation: User interfaces 77
3.1 Presenting documents 81
Hierarchically structured documents 81
Plain, unstructured text documents 83
vi CONTENTS
Page images 86
Page images and extracted text 88
Audio and photographic images 89
Video 91
Music 92
Foreign languages 93
3.2 Presenting metadata 96
3.3 Searching 99
Types of query 100
Case-folding and stemming 104
Phrase searching 106
Different query interfaces 108
3.4 Browsing 112
Browsing alphabetical lists 113
Ordering lists of words in Chinese 114
Browsing by date 116
Hierarchical classification structures 116
3.5 Phrase browsing 119
A phrase browsing interface 119
Key phrases 122
3.6 Browsing using extracted metadata 124
Acronyms 125
Language identification 126
3.7 Notes and sources 126
Collections 126
Metadata 127
Searching 127
Browsing 128
4. Documents: The raw material 131
4.1 Representing characters 134
Unicode 137
The Unicode character set 138
Composite and combining characters 143
Unicode character encodings 146
Hindi and related scripts 149
Using Unicode in a digital library 154
4.2 Representing documents 155
Plain text 156
CONTENTS vii
Indexing 157
Word segmentation 160
4.3 Page description languages: PostScript and PDF 163
PostScript 164
Fonts 170
Text extraction 173
Using PostScript in a digital library 178
Portable Document Format: PDF 179
PDF and PostScript 183
4.4 Word-processor documents 184
Rich Text Format 185
Native Word formats 191
LaTeX format 191
4.5 Representing images 194
Lossless image compression: GIF and PNG 195
Lossy image compression: JPEG 197
Progressive refinement 203
4.6 Representing audio and video 206
Multimedia compression: MPEG 207
MPEG video 210
MPEG audio 211
Mixing media 212
Other multimedia formats 214
Using multimedia in a digital library 215
4.7 Notes and sources 216
5. Markup and metadata: Elements of organization 221
5.1 Hypertext markup language: HTML 224
Basic HTML 225
Using HTML in a digital library 228
5.2 Extensible markup language: XML 229
Development of markup and stylesheet languages 230
The XML metalanguage 232
Parsing XML 235
Using XML in a digital library 236
5.3 Presenting marked-up documents 237
Cascading style sheets: CSS 237
Extensible stylesheet language: XSL 245
viii CONTENTS
5.4 Bibliographic metadata 253
MARC 254
Dublin Core 257
BibTeX 258
Refer 260
5.5 Metadata for images and multimedia 261
Image metadata: TIFF 262
Multimedia metadata: MPEG-7 263
5.6 Extracting metadata 266
Extracting document metadata 267
Generic entity extraction 268
Bibliographic references 270
Language identification 270
Acronym extraction 271
Key-phrase extraction 273
Phrase hierarchies 277
5.7 Notes and sources 280
6. Construction: Building collections with Greenstone 283
6.1 Why Greenstone? 285
What it does 285
How to use it 288
6.2 Using the Collector 292
Creating a new collection 293
Working with existing collections 300
Document formats 301
6.3 Building collections manually: A walkthrough 302
Getting started 303
Making a framework for the collection 304
Importing the documents 305
Building the indexes 307
Installing the collection 308
6.4 Importing and building 309
Files and directories 310
Object identifiers 312
Plug-ins 313
The import process 314
The build process 317
CONTENTS ix
6.5 Greenstone archive documents 319
Document metadata 320
Inside the documents 322
6.6 Collection configuration file 323
Default configuration file 324
Subcollections and supercollections 325
6.7 Getting the most out of your documents 327
Plug-ins 327
Classifiers 336
Format statements 342
6.8 Building collections graphically 349
6.9 Notes and sources 353
7. Delivery: How Greenstone works 355
7.1 Processes and protocols 356
Processes 357
The null protocol implementation 357
The Corba protocol implementation 359
7.2 Preliminaries 360
The macro language 360
The collection information database 369
7.3 Responding to user requests 372
Performing a search 375
Retrieving a document 376
Browsing a hierarchical classifier 377
Generating the home page 378
Using the protocol 378
Actions 384
7.4 Operational aspects 385
Configuring the receptionist 386
Configuring the site 391
7.5 Notes and sources 392
8. Interoperability: Standards and protocols 393
8.1 More markup 395
Names 395
x CONTENTS
Links 397
Types 402
8.2 Resource description 408
Collection-level metadata 410
8.3 Document exchange 413
Open eBook 414
8.4 Query languages 419
Common command language 419
XML Query 422
8.5 Protocols 426
Z39.50 427
Supporting the Z39.50 protocol 429
The Open Archives Initiative 430
Supporting the OAI protocol 433
8.6 Research protocols 434
Dienst 435
Simple digital library interoperability protocol 436
Translating between protocols 437
Discussion 438
8.7 Notes and sources 440
9. Visions: Future, past, and present 443
9.1 Libraries of the future 445
Today’s visions 445
Tomorrow’s visions 448
Working inside the digital library 451
9.2 Preserving the past 454
The problem of preservation 455
A tale of preservation in the digital era 456
The digital dark ages 457
Preservation strategies 459
9.3 Generalized documents: A challenge for the present 462
Digital libraries of music 462
Other media 466
Generalized documents in Greenstone 469
Digital libraries for oral cultures 471
9.4 Notes and sources 474
CONTENTS xi
Appendix: Installing and operating Greenstone 477
Glossary 481
References 489
Index 499
About the authors 517
xii CONTENTS
xiii
Figures
Figure 1.1 Kataayi’s information and communication center. 2
Figure 1.2 The Zia Pueblo village. 3
Figure 1.3 The New York Public Library. 6
Figure 1.4 Rubbing from a stele in Xi’an. 9
Figure 1.5 A page of the original Trinity College Library catalog. 13
Figure 1.6 The Bibliothèque Nationale de France. 15
Figure 1.7 Artist’s conception of the Memex, Bush’s automated library. 16
Figure 1.8 Part of a page from the Book of Kells. 18
Figure 1.9 Pages from a palm-leaf manuscript in Thanjavur, India. 19
Figure 1.10 Ma¯ori toki or ceremonial adze, emblem of the Greenstone project.
25
Figure 2.1 Scanning and optical character recognition. 59
Figure 2.2 (a) Document image containing different types of data;
(b) the document image segmented into different regions. 64
Figure 2.3 (a) Double-page spread of a Ma¯ori newspaper; (b) enlarged
version; (c) OCR text. 71
Figure 3.1 Finding a quotation in Alice’s Adventures in Wonderland. 78
Figure 3.2 Different-looking digital libraries: (a) Kids’ Digital Library (b)
School Journal Digital Library. 80
Figure 3.3 Village-Level Brickmaking: (a) the book; (b) the chapter on
Moulding; (c, d) some of the pages. 82
Figure 3.4 Alice’s Adventures in Wonderland. 84
Figure 3.5 A story from the School Journal collection: (a) “Never Shout
at a Draft Horse!”; (b) with search term highlighted
(mock-up). 86
Figure 3.6 A historic Ma¯ori newspaper: (a) page image; (b) extracted text.
88
Figure 3.7 Listening to a tape from the Oral History collection. 90
Figure 3.8 Finding Auld Lang Syne in a digital music library. 92
Figure 3.9 Foreign-language collections: (a) French (b) Portuguese interface
to an English collection. 94
Figure 3.10 Documents from two Chinese collections: (a) rubbings of Tang
poetry; (b) classic literature. 95
Figure 3.11 An Arabic collection: (a) a document; (b) searching. 96
Figure 3.12 Bibliography display. 97
Figure 3.13 Metadata examples: (a) bibliography record retrieved from the
Library of Congress; (b) description of a BBC television
program. 98
Figure 3.14 Searching for a quotation: (a) query page; (b) query response.
100
Figure 3.15 Choosing search preferences. 104
Figure 3.16 Large-query search interface. 109
Figure 3.17 Query with history. 110
Figure 3.18 Form search: (a) simple; (b) advanced. 111
Figure 3.19 Browsing an alphabetical list of titles: (a) plain list;
(b) with A–Z tags. 113
Figure 3.20 Browsing a list of titles in Chinese: (a) stroke-based browsing;
(b) Pinyin browsing. 115
Figure 3.21 Browsing by date. 117
Figure 3.22 Browsing a classification hierarchy: (a) the beginning;
(b) expanding Sustainable development; (c) expanding
Organizations, institutions. 118
Figure 3.23 (a) Browsing for information about locusts; (b) expanding on
desert locust; (c) document about desert locusts. 120
Figure 3.24 (a) Browsing for information on poisson; (b) INFOPECHE Web
page. 122
Figure 3.25 Browsing interfaces based on key phrases: (a) hierarchical
browser; (b) document explorer. 123
Figure 3.26 Browsing based on information mined from the document
collection: (a) acronyms; (b) language identification. 125
Figure 4.1 Unicode excerpt: Basic Latin and Latin-1 Supplement
(U+0000–U+00FF). 142
Figure 4.2 Unicode excerpts: (a) Latin Extended A (U+0100-U+017F);
(b) Cyrillic (U+0400-U+045F). 143
xiv FIGURES