Tài liệu Taming Text pdf

www.it-ebooks.info

Taming Text

www.it-ebooks.info

Taming Text

HOW TO FIND, ORGANIZE, AND MANIPULATE IT

GRANT S. INGERSOLL

THOMAS S. MORTON

ANDREW L. FARRIS

MANNING

SHELTER ISLAND

www.it-ebooks.info

For online information and ordering of this and other Manning books, please visit

www.manning.com. The publisher offers discounts on this book when ordered in quantity.

For more information, please contact

Special Sales Department

Manning Publications Co.

20 Baldwin Road

PO Box 261

Shelter Island, NY 11964

Email: [email protected]

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in

any form or by means electronic, mechanical, photocopying, or otherwise, without prior written

permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are

claimed as trademarks. Where those designations appear in the book, and Manning

Publications was aware of a trademark claim, the designations have been printed in initial caps

or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have

the books we publish printed on acid-free paper, and we exert our best efforts to that end.

Recognizing also our responsibility to conserve the resources of our planet, Manning books

are printed on paper that is at least 15 percent recycled and processed without the use of

elemental chlorine.

Manning Publications Co. Development editor: Jeff Bleiel

20 Baldwin Road Technical proofreader: Steven Rowe

PO Box 261 Copyeditor: Benjamin Berg

Shelter Island, NY 11964 Proofreader: Katie Tennant

Typesetter: Dottie Marsico

Cover designer: Marija Tudor

ISBN 9781933988382

Printed in the United States of America

1 2 3 4 5 6 7 8 9 10 – MAL – 18 17 16 15 14 13

www.it-ebooks.info

brief contents

1 ■ Getting started taming text 1

2 ■ Foundations of taming text 16

3 ■ Searching 37

4 ■ Fuzzy string matching 84

5 ■ Identifying people, places, and things 115

6 ■ Clustering text 140

7 ■ Classification, categorization, and tagging 175

8 ■ Building an example question answering system 240

9 ■ Untamed text: exploring the next frontier 260

www.it-ebooks.info

vii

contents

foreword xiii

preface xiv

acknowledgments xvii

about this book xix

about the cover illustration xxii

1 Getting started taming text 1

1.1 Why taming text is important 2

1.2 Preview: A fact-based question answering system 4

Hello, Dr. Frankenstein 5

1.3 Understanding text is hard 8

1.4 Text, tamed 10

1.5 Text and the intelligent app: search and beyond 11

Searching and matching 12 ■ Extracting information 13

Grouping information 13 ■ An intelligent application 14

1.6 Summary 14

1.7 Resources 14

2 Foundations of taming text 16

2.1 Foundations of language 17

Words and their categories 18 ■ Phrases and clauses 19

Morphology 20

www.it-ebooks.info

viii CONTENTS

2.2 Common tools for text processing 21

String manipulation tools 21 ■ Tokens and tokenization 22

Part of speech assignment 24 ■ Stemming 25 ■ Sentence

detection 27 ■ Parsing and grammar 28 ■ Sequence

modeling 30

2.3 Preprocessing and extracting content from common file

formats 31

The importance of preprocessing 31 ■ Extracting content using

Apache Tika 33

2.4 Summary 36

2.5 Resources 36

3 Searching 37

3.1 Search and faceting example: Amazon.com 38

3.2 Introduction to search concepts 40

Indexing content 41 ■ User input 43 ■ Ranking documents

with the vector space model 46 ■ Results display 49

3.3 Introducing the Apache Solr search server 52

Running Solr for the first time 52 ■ Understanding Solr

concepts 54

3.4 Indexing content with Apache Solr 57

Indexing using XML 58 ■ Extracting and indexing content

using Solr and Apache Tika 59

3.5 Searching content with Apache Solr 63

Solr query input parameters 64 ■ Faceting on extracted

content 67

3.6 Understanding search performance factors 69

Judging quality 69 ■ Judging quantity 73

3.7 Improving search performance 74

Hardware improvements 74 ■ Analysis improvements 75

Query performance improvements 76 ■ Alternative scoring

models 79 ■ Techniques for improving Solr performance 80

3.8 Search alternatives 82

3.9 Summary 83

3.10 Resources 83

www.it-ebooks.info

CONTENTS ix

4 Fuzzy string matching 84

4.1 Approaches to fuzzy string matching 86

Character overlap measures 86 ■ Edit distance measures 89

N-gram edit distance 92

4.2 Finding fuzzy string matches 94

Using prefixes for matching with Solr 94 ■ Using a trie for

prefix matching 95 ■ Using n-grams for matching 99

4.3 Building fuzzy string matching applications 100

Adding type-ahead to search 101 ■ Query spell-checking for

search 105 ■ Record matching 109

4.4 Summary 114

4.5 Resources 114

5 Identifying people, places, and things 115

5.1 Approaches to named-entity recognition 117

Using rules to identify names 117 ■ Using statistical

classifiers to identify names 118

5.2 Basic entity identification with OpenNLP 119

Finding names with OpenNLP 120 ■ Interpreting names

identified by OpenNLP 121 ■ Filtering names based on

probability 122

5.3 In-depth entity identification with OpenNLP 123

Identifying multiple entity types with OpenNLP 123

Under the hood: how OpenNLP identifies names 126

5.4 Performance of OpenNLP 128

Quality of results 129 ■ Runtime performance 130

Memory usage in OpenNLP 131

5.5 Customizing OpenNLP entity identification

for a new domain 132

The whys and hows of training a model 132 ■ Training

an OpenNLP model 133 ■ Altering modeling inputs 134

A new way to model names 136

5.6 Summary 138

5.7 Further reading 139

www.it-ebooks.info

x CONTENTS

6 Clustering text 140

6.1 Google News document clustering 141

6.2 Clustering foundations 142

Three types of text to cluster 142 ■ Choosing a clustering

algorithm 144 ■ Determining similarity 145 ■ Labeling the

results 146 ■ How to evaluate clustering results 147

6.3 Setting up a simple clustering application 149

6.4 Clustering search results using Carrot2 149

Using the Carrot2 API 150 ■ Clustering Solr search results

using Carrot2 151

6.5 Clustering document collections with Apache

Mahout 154

Preparing the data for clustering 155 ■ K-Means

clustering 158

6.6 Topic modeling using Apache Mahout 162

6.7 Examining clustering performance 164

Feature selection and reduction 164 ■ Carrot2 performance

and quality 167 ■ Mahout clustering benchmarks 168

6.8 Acknowledgments 172

6.9 Summary 173

6.10 References 173

7 Classification, categorization, and tagging 175

7.1 Introduction to classification and categorization 177

7.2 The classification process 180

Choosing a classification scheme 181 ■ Identifying features

for text categorization 182 ■ The importance of training

data 183 ■ Evaluating classifier performance 186

Deploying a classifier into production 188

7.3 Building document categorizers using Apache

Lucene 189

Categorizing text with Lucene 189 ■ Preparing the training

data for the MoreLikeThis categorizer 191 ■ Training the

MoreLikeThis categorizer 193 ■ Categorizing documents

with the MoreLikeThis categorizer 197 ■ Testing the

MoreLikeThis categorizer 199 ■ MoreLikeThis in

production 201

www.it-ebooks.info

CONTENTS xi

7.4 Training a naive Bayes classifier using Apache

Mahout 202

Categorizing text using naive Bayes classification 202

Preparing the training data 204 ■ Withholding test data 207

Training the classifier 208 ■ Testing the classifier 209

Improving the bootstrapping process 210 ■ Integrating the

Mahout Bayes classifier with Solr 212

7.5 Categorizing documents with OpenNLP 215

Regression models and maximum entropy ■ document

categorization 216 ■ Preparing training data for the maximum

entropy document categorizer 219 ■ Training the maximum

entropy document categorizer 220 ■ Testing the maximum entropy

document classifier 224 ■ Maximum entropy document

categorization in production 225

7.6 Building a tag recommender using Apache Solr 227

Collecting training data for tag recommendations 229

Preparing the training data 231 ■ Training the Solr tag

recommender 232 ■ Creating tag recommendations 234

Evaluating the tag recommender 236

7.7 Summary 238

7.8 References 239

8 Building an example question answering system 240

8.1 Basics of a question answering system 242

8.2 Installing and running the QA code 243

8.3 A sample question answering architecture 245

8.4 Understanding questions and producing answers 248

Training the answer type classifier 248 ■ Chunking the

query 251 ■ Computing the answer type 252 ■ Generating the

query 255 ■ Ranking candidate passages 256

8.5 Steps to improve the system 258

8.6 Summary 259

8.7 Resources 259

9 Untamed text: exploring the next frontier 260

9.1 Semantics, discourse, and pragmatics:

exploring higher levels of NLP 261

Semantics 262 ■ Discourse 263 ■ Pragmatics 264

www.it-ebooks.info

xii CONTENTS

9.2 Document and collection summarization 266

9.3 Relationship extraction 268

Overview of approaches 270 ■ Evaluation 272 ■ Tools for

relationship extraction 273

9.4 Identifying important content and people 273

Global importance and authoritativeness 274 ■ Personal

importance 275 ■ Resources and pointers on importance 275

9.5 Detecting emotions via sentiment analysis 276

History and review 276 ■ Tools and data needs 278 ■ A basic

polarity algorithm 279 ■ Advanced topics 280 ■ Open source

libraries for sentiment analysis 281

9.6 Cross-language information retrieval 282

9.7 Summary 284

9.8 References 284

index 287

www.it-ebooks.info

xiii

foreword

At a time when the demand for high-quality text processing capabilities continues to

grow at an exponential rate, it’s difficult to think of any sector or business that doesn’t

rely on some type of textual information. The burgeoning web-based economy has

dramatically and swiftly increased this reliance. Simultaneously, the need for talented

technical experts is increasing at a fast pace. Into this environment comes an excellent, very pragmatic book, Taming Text, offering substantive, real-world, tested guidance and instruction.

Grant Ingersoll and Drew Farris, two excellent and highly experienced software

engineers with whom I’ve worked for many years, and Tom Morton, a well-respected

contributor to the natural language processing field, provide a realistic course for

guiding other technical folks who have an interest in joining the highly recruited coterie of text processors, a.k.a. natural language processing (NLP) engineers.

In an approach that equates with what I think of as “learning for the world, in the

world,” Grant, Drew, and Tom take the mystery out of what are, in truth, very complex

processes. They do this by focusing on existing tools, implemented examples, and

well-tested code, versus taking you through the longer path followed in semester-long

NLP courses.

As software engineers, you have the basics that will enable you to latch onto the

examples, the code bases, and the open source tools here referenced, and become true

experts, ready for real-world opportunites, more quickly than you might expect.

LIZ LIDDY

DEAN, ISCHOOL

SYRACUSE UNIVERSITY

www.it-ebooks.info

xiv

preface

Life is full of serendipitous moments, few of which stand out for me (Grant) like the

one that now defines my career. It was the late 90s, and I was a young software developer working on distributed electromagnetics simulations when I happened on an ad

for a developer position at a small company in Syracuse, New York, called TextWise.

Reading the description, I barely thought I was qualified for the job, but decided to

take a chance anyway and sent in my resume. Somehow, I landed the job, and thus

began my career in search and natural language processing. Little did I know that, all

these years later, I would still be doing search and NLP, never mind writing a book on

those subjects.

My first task back then was to work on a cross-language information retrieval

(CLIR) system that allowed users to enter queries in English and find and automatically translate documents in French, Spanish, and Japanese. In retrospect, that first

system I worked on touched on all the hard problems I’ve come to love about working

with text: search, classification, information extraction, machine translation, and all

those peculiar rules about languages that drive every grammar student crazy. After

that first project, I’ve worked on a variety of search and NLP systems, ranging from

rule-based classifiers to question answering (QA) systems. Then, in 2004, a new job at

the Center for Natural Language Processing led me to the use of Apache Lucene, the

de facto open source search library (these days, anyway). I once again found myself

writing a CLIR system, this time to work with English and Arabic. Needing some

Lucene features to complete my task, I started putting up patches for features and bug

fixes. Sometime thereafter, I became a committer. From there, the floodgates opened.

I got more involved in open source, starting the Apache Mahout machine learning

Thư viện tri thức trực tuyến

Tài liệu Taming Text pdf

Nội dung xem thử

Mô tả chi tiết

Tài liệu tương tự (6)

oreilly learning jquery deferreds taming callback hell with deferreds and promises 2014 tủ tài liệu

Tài liệu RISK-TAKING BEHAVIOUR AND OWNERSHIP IN THE BANKING INDUSTRY: THE SPANISH EVIDENCE ppt

Tài liệu

tài liệu

tài liêu

TÀI LIỆU