Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Mastering Data Mining with Python – Find patterns hidden in your data
Nội dung xem thử
Mô tả chi tiết
Mastering Data Mining with
Python – Find patterns hidden
in your data
Learn how to create more powerful data mining
applications with this comprehensive Python guide
to advance data analytics techniques
Megan Squire
BIRMINGHAM - MUMBAI
Mastering Data Mining with Python – Find patterns
hidden in your data
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author(s), nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: August 2016
Production reference: 1240816
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-995-0
www.packtpub.com
OR05236
Credits
Author
Megan Squire
Reviewers
Sanjeev Jaiswal
Ron Mitsugo Zacharski
Commissioning Editor
Veena Pagare
Acquisition Editor
Lester Frias
Content Development Editor
Mamata Walkar
Technical Editor
Naveenkumar Jain
Copy Editors
Safis Editing
Sneha Singh
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Pratik Shirodkar
Graphics
Kirk D'Penha
Production Coordinator
Shantanu N. Zagade
Cover Work
Shantanu N. Zagade
About the Author
Megan Squire is a professor of computing sciences at Elon University.
Her primary research interest is in collecting, cleaning, and analyzing data
about how free and open source software is made. She is one of the leaders
of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
About the Reviewers
Sanjeev Jaiswal is a computer graduate with 7 years of industrial experience.
His works involves Perl, Python, and GNU/Linux. He is currently working on
projects involving penetration testing, source code review, and security design
and implementations.
He is very much interested in web and cloud security. He is also learning NodeJS
and cloud security.
Sanjeev loves teaching engineering students and IT professionals. He has been
teaching for the last 8 years in his free time. He founded Alien Coders (http://www.
aliencoders.org), based on the learning through sharing principle for computer
science students and IT professionals in 2010, which became a huge hit in India
among engineering students.
You can follow him on Facebook at http://www.facebook.com/aliencoders,
on Twitter at @aliencoders, and on GitHub at https://github.com/jassics.
Sanjeev wrote Instant PageSpeed Optimization and co-authored Learning Django Web
Development for Packt Publishing. He has reviewed more than 5 books for Packt and
looks forward to more such opportunities.
Ron Mitsugo Zacharski is a computational linguist working in the areas of
information extraction and machine learning (zacharski.org). He has a BFA in
music from the University of Wisconsin at Milwaukee and a PhD in computer
science from the University of Minnesota, and he completed a post doctorate
in linguistics at the University of Edinburgh. He authored the free online book
A Programmer's Guide to Data Mining: The Ancient Art of the Numerati (www.
guidetodatamining.com) and co-edited The Grammar-Pragmatics Interface: Essays
in Honor of Jeanette K. Gundel, published by John Benjamins. For the majority of
his academic life, he has focused on multilingual natural language processing,
particularly with lesser-studied languages. Dr. Zacharski is a Zen monk in the
Sōtō School lineage of Soyu Matsuoka. He lives in New Mexico.
www.PacktPub.com
eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
TM
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe? • Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser
[ i ]
Table of Contents
Preface vii
Chapter 1: Expanding Your Data Mining Toolbox 1
What is data mining? 2
How do we do data mining? 4
The Fayyad et al. KDD process 5
The Han et al. KDD process 5
The CRISP-DM process 6
The Six Steps process 7
Which data mining methodology is the best? 8
What are the techniques used in data mining? 9
What techniques are we going to use in this book? 11
How do we set up our data mining work environment? 11
Summary 18
Chapter 2: Association Rule Mining 19
What are frequent itemsets? 20
The diapers and beer urban legend 20
Frequent itemset mining basics 21
Towards association rules 23
Support 23
Confidence 24
Association rules 24
An example with data 25
Added value – fixing a flaw in the plan 27
Methods for finding frequent itemsets 28
A project – discovering association rules in software project tags 30
Summary 46
Table of Contents
[ ii ]
Chapter 3: Entity Matching 47
What is entity matching? 48
Merging data 51
Merging datasets vertically 51
Merging datasets horizontally 53
Techniques for matching 54
Attribute-based similarity matching 54
Be careful of pairwise comparisons 54
Leverage rare values 55
Methods for matching attributes 55
Range-based or distance from target 55
String edit distance 55
Hamming distance 56
Levenshtein distance 56
Soundex 57
Leveraging disjoint sets 58
Context-based similarity matching 58
Machine learning-based entity matching 59
Evaluation of entity matching techniques 60
Efficiency – how long does it take to do the matching? 60
Effectiveness – how accurate are the matches that we generate? 61
Usefulness – how practical is the matching procedure to use? 63
Entity matching project 64
Difficulties with matching software projects 65
Two examples 65
Matching on project names 67
Matching on people names 67
Matching on URLs 67
Matching on topics and description keywords 68
The dataset 69
The code 70
The results 75
How many entity matches did we find? 76
How good are the pairs we found? 77
Summary 80
Chapter 4: Network Analysis 81
What is a network? 82
Measuring a network 85
Degree of a network 85
Diameter of a network 86
Walks, paths, and trails in a network 88
Components of a network 88
Table of Contents
[ iii
]
Centrality of a network 89
Closeness centrality 89
Degree centrality 90
Betweenness centrality 91
Other measures of centrality 92
Representing graph data 93
Adjacency matrix 93
Edge lists and adjacency lists 95
Differences between graph data structures 95
Importing data into a graph structure 96
Adjacency list format 97
Edge list format 97
GEXF and GraphML 98
GDF 99
Python pickle 100
JSON 100
JSON node and link series 100
JSON trees 101
Pajek format 102
A real project 103
Exploring the data 104
Generating the network files
111
Understanding our data as a network
112
Generating simple network metrics
113
Playing with the parameters of a network
116
Analyzing subgraphs
118
Analyzing cliques and centrality in the subgraphs 121
Looking for change over time 124
Summary 134
Chapter 5: Sentiment Analysis in Text 135
What is sentiment analysis? 136
The basics of sentiment analysis 137
The structure of an opinion 137
Document-level and sentence-level analysis 139
Important features of opinions 140
Sentiment analysis algorithms 141
General-purpose data collections 142
Hu and Liu's sentiment analysis lexicon 142
SentiWordNet 143
Vader sentiment 143
Sentiment mining application 144
Motivating the project 145
Data preparation 145
Table of Contents
[ iv
]
Data analysis of chat messages 149
Data analysis of e-mail messages 154
Summary 160
Chapter 6: Named Entity Recognition in Text 161
Why look for named entities? 162
Techniques for named entity recognition 164
Tagging parts of speech 166
Classes of named entities 167
Building and evaluating NER systems 168
NER and partial matches 168
Handling partial matches 169
Named entity recognition project 171
A simple NER tool 172
Apache Board meeting minutes 173
Django IRC chat 175
GnuIRC summaries 179
LKML e-mails 182
Summary 183
Chapter 7: Automatic Text Summarization 185
What is automatic text summarization? 186
Tools for text summarization 187
Naive text summarization using NLTK 187
Text summarization using Gensim 190
Text summarization using Sumy 193
Sumy's Luhn summarizer 194
Sumy's TextRank summarizer 195
Sumy's LSA summarizer 196
Sumy's Edmundson summarizer 197
Summary 199
Chapter 8: Topic Modeling in Text 201
What is topic modeling? 202
Latent Dirichlet Allocation 203
Gensim for topic modeling 204
Understanding Gensim LDA topics 207
Understanding Gensim LDA passes 208
Applying a Gensim LDA model to new documents 210
Serializing Gensim LDA objects 21
1
Serializing a dictionary 21
1
Serializing a corpus 212
Serializing a model 213
Gensim LDA for a larger project 213
Summary 216
Table of Contents
[ v ]
Chapter 9: Mining for Data Anomalies 217
What are data anomalies? 218
Missing data 218
Locating missing data 218
Zero values 220
Fixing missing data 220
Ignore the problem rows 220
Fix the problem manually 221
Use a fabricated value 222
Use a central measure 223
Use Last Observation Carried Forward 223
Use a similar value 224
Use the most likely value 224
Data errors 224
Truncated fields 225
Data type and character set errors 226
Logic or semantic errors 227
Outliers 228
Visual mining for outliers 230
Statistical detection of outliers 231
Summary 238
Index 239
[ vii ]
Preface
Over the past decade, cheaper data storage, faster hardware, and impressive
advances in algorithms have combined to pave the way for a rapid ascendance
of data science as one of the most important opportunities in computing. While
the term data science can include everything from cleaning data and storing data
to visualizing it in graphs and charts, the area that has made the most significant
gain is the invention of intelligent and sophisticated algorithms for analyzing data.
Using computers to find the interesting patterns buried within massive amounts of
data is called data mining, an area that encompasses elements of database systems,
statistics, and machine learning.
Right now there are dozens of great data mining and machine learning books
available for software developers to get up to date on all these advances in the
field. What most of these books have in common is that they all cover a small set
of tried-and-true methods for finding patterns in data: classification, clustering,
decision trees, and regression. Of course, all of these are critically important methods
for any data miner to know and they are popular because they can be effective.
But these same few techniques are not the whole story. Data mining is a rich field
encompassing many dozens of techniques to uncover patterns and make predictions.
A true master of data mining should have many tools in her toolbox, not just a few.
Thus, the mission of this book, Mastering Data Mining with Python, is to introduce
some of the lesser-known data mining concepts that are typically only covered in
academic textbooks.
This book uses the Python programming language and a project-based approach to
introduce diverse and often overlooked data mining concepts, such as association
rules, entity matching, network analysis, text mining, and anomaly detection. Each
chapter thoroughly illustrates the basics of one particular data mining technique,
provides alternatives for evaluating its effectiveness, and then implements the
technique using real-world data.
Preface
[ viii ]
Our focus on real-world data is another feature of this book that sets it apart from
many other data mining books. The true test of whether we have mastered a concept
is whether we can apply a method to a new, unknown problem. In our case, this
means applying each data mining method to a new problem area or a new data set.
The emphasis on real data also means that our results may not always be as clean
and tidy as results that come from a canned, example data set. For this reason, each
chapter includes a discussion for how to critically evaluate the method. Do the
results make sense? What do the results mean? How can the results be improved?
So, in many ways, this book picks up where some of the other data mining books
leave off. If you want to round up your growing data mining toolbox with a set of
interesting but often overlooked techniques, then read on to learn the specific topics
we will cover and how they will be applied in each chapter.
What this book covers
Chapter 1, Expanding Your Data Mining Toolbox, gives an introduction to the field of
data mining. In this chapter we pay special attention to how data mining relates
to similar topics, such as machine learning and data science. We also review many
different data mining methodologies, and talk about their various strengths and
weaknesses. This foundational knowledge is important as we transition into the
remaining chapters of the book, which are much more technique-oriented and focus
on the application of specific data mining tools.
Chapter 2, Association Rule Mining, introduces our first data mining tool: mining
for co-occurring sets of items, sometimes called frequent itemsets. We extend our
understanding of frequent itemset mining to include mining for association rules,
and we learn how to evaluate whether the rules we have found are helpful or not.
To put our knowledge into practice, at the end of the chapter we implement a small
project wherein we find association rules in the keywords chosen to describe a large
set of software projects.
Chapter 3, Entity Matching, focuses on finding matching pairs of data elements that
may look slightly different but are actually the same. We learn how to determine
whether two items are actually the same thing by using the attributes of the data. At
the end of the chapter, we implement an entity matching project where we learn to
find the software projects that have moved from one hosting service to another, even
after changing their names and other important attributes.