Mastering Data Mining with Python – Find patterns hidden in your data

Mastering Data Mining with

Python – Find patterns hidden

in your data

Learn how to create more powerful data mining

applications with this comprehensive Python guide

to advance data analytics techniques

Megan Squire

BIRMINGHAM - MUMBAI

Mastering Data Mining with Python – Find patterns

hidden in your data

system, or transmitted in any form or by any means, without the prior written

permission of the publisher, except in the case of brief quotations embedded in

critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy

of the information presented. However, the information contained in this book is

sold without warranty, either express or implied. Neither the author(s), nor Packt

Publishing, and its dealers and distributors will be held liable for any damages

caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the

companies and products mentioned in this book by the appropriate use of capitals.

However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2016

Production reference: 1240816

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-995-0

www.packtpub.com

OR05236

Credits

Author

Megan Squire

Reviewers

Sanjeev Jaiswal

Ron Mitsugo Zacharski

Commissioning Editor

Veena Pagare

Acquisition Editor

Lester Frias

Content Development Editor

Mamata Walkar

Technical Editor

Naveenkumar Jain

Copy Editors

Safis Editing

Sneha Singh

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Pratik Shirodkar

Graphics

Kirk D'Penha

Production Coordinator

Shantanu N. Zagade

Cover Work

Shantanu N. Zagade

About the Author

Megan Squire is a professor of computing sciences at Elon University.

Her primary research interest is in collecting, cleaning, and analyzing data

about how free and open source software is made. She is one of the leaders

of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.

About the Reviewers

Sanjeev Jaiswal is a computer graduate with 7 years of industrial experience.

His works involves Perl, Python, and GNU/Linux. He is currently working on

projects involving penetration testing, source code review, and security design

and implementations.

He is very much interested in web and cloud security. He is also learning NodeJS

and cloud security.

Sanjeev loves teaching engineering students and IT professionals. He has been

teaching for the last 8 years in his free time. He founded Alien Coders (http://www.

aliencoders.org), based on the learning through sharing principle for computer

science students and IT professionals in 2010, which became a huge hit in India

among engineering students.

You can follow him on Facebook at http://www.facebook.com/aliencoders,

on Twitter at @aliencoders, and on GitHub at https://github.com/jassics.

Sanjeev wrote Instant PageSpeed Optimization and co-authored Learning Django Web

Development for Packt Publishing. He has reviewed more than 5 books for Packt and

looks forward to more such opportunities.

Ron Mitsugo Zacharski is a computational linguist working in the areas of

information extraction and machine learning (zacharski.org). He has a BFA in

music from the University of Wisconsin at Milwaukee and a PhD in computer

science from the University of Minnesota, and he completed a post doctorate

in linguistics at the University of Edinburgh. He authored the free online book

A Programmer's Guide to Data Mining: The Ancient Art of the Numerati (www.

guidetodatamining.com) and co-edited The Grammar-Pragmatics Interface: Essays

in Honor of Jeanette K. Gundel, published by John Benjamins. For the majority of

his academic life, he has focused on multilingual natural language processing,

particularly with lesser-studied languages. Dr. Zacharski is a Zen monk in the

Sōtō School lineage of Soyu Matsuoka. He lives in New Mexico.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF

and ePub files available? You can upgrade to the eBook version at www.PacktPub.

com and as a print book customer, you are entitled to a discount on the eBook copy.

Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign

up for a range of free newsletters and receive exclusive discounts and offers on Packt

books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital

book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe? • Fully searchable across every book published by Packt

• Copy and paste, print, and bookmark content

• On demand and accessible via a web browser

[ i ]

Table of Contents

Preface vii

Chapter 1: Expanding Your Data Mining Toolbox 1

What is data mining? 2

How do we do data mining? 4

The Fayyad et al. KDD process 5

The Han et al. KDD process 5

The CRISP-DM process 6

The Six Steps process 7

Which data mining methodology is the best? 8

What are the techniques used in data mining? 9

What techniques are we going to use in this book? 11

How do we set up our data mining work environment? 11

Summary 18

Chapter 2: Association Rule Mining 19

What are frequent itemsets? 20

The diapers and beer urban legend 20

Frequent itemset mining basics 21

Towards association rules 23

Support 23

Confidence 24

Association rules 24

An example with data 25

Added value – fixing a flaw in the plan 27

Methods for finding frequent itemsets 28

A project – discovering association rules in software project tags 30

Summary 46

Table of Contents

[ ii ]

Chapter 3: Entity Matching 47

What is entity matching? 48

Merging data 51

Merging datasets vertically 51

Merging datasets horizontally 53

Techniques for matching 54

Attribute-based similarity matching 54

Be careful of pairwise comparisons 54

Leverage rare values 55

Methods for matching attributes 55

Range-based or distance from target 55

String edit distance 55

Hamming distance 56

Levenshtein distance 56

Soundex 57

Leveraging disjoint sets 58

Context-based similarity matching 58

Machine learning-based entity matching 59

Evaluation of entity matching techniques 60

Efficiency – how long does it take to do the matching? 60

Effectiveness – how accurate are the matches that we generate? 61

Usefulness – how practical is the matching procedure to use? 63

Entity matching project 64

Difficulties with matching software projects 65

Two examples 65

Matching on project names 67

Matching on people names 67

Matching on URLs 67

Matching on topics and description keywords 68

The dataset 69

The code 70

The results 75

How many entity matches did we find? 76

How good are the pairs we found? 77

Summary 80

Chapter 4: Network Analysis 81

What is a network? 82

Measuring a network 85

Degree of a network 85

Diameter of a network 86

Walks, paths, and trails in a network 88

Components of a network 88

Table of Contents

[ iii

]

Centrality of a network 89

Closeness centrality 89

Degree centrality 90

Betweenness centrality 91

Other measures of centrality 92

Representing graph data 93

Adjacency matrix 93

Edge lists and adjacency lists 95

Differences between graph data structures 95

Importing data into a graph structure 96

Adjacency list format 97

Edge list format 97

GEXF and GraphML 98

GDF 99

Python pickle 100

JSON 100

JSON node and link series 100

JSON trees 101

Pajek format 102

A real project 103

Exploring the data 104

Generating the network files

111

Understanding our data as a network

112

Generating simple network metrics

113

Playing with the parameters of a network

116

Analyzing subgraphs

118

Analyzing cliques and centrality in the subgraphs 121

Looking for change over time 124

Summary 134

Chapter 5: Sentiment Analysis in Text 135

What is sentiment analysis? 136

The basics of sentiment analysis 137

The structure of an opinion 137

Document-level and sentence-level analysis 139

Important features of opinions 140

Sentiment analysis algorithms 141

General-purpose data collections 142

Hu and Liu's sentiment analysis lexicon 142

SentiWordNet 143

Vader sentiment 143

Sentiment mining application 144

Motivating the project 145

Data preparation 145

Table of Contents

[ iv

]

Data analysis of chat messages 149

Data analysis of e-mail messages 154

Summary 160

Chapter 6: Named Entity Recognition in Text 161

Why look for named entities? 162

Techniques for named entity recognition 164

Tagging parts of speech 166

Classes of named entities 167

Building and evaluating NER systems 168

NER and partial matches 168

Handling partial matches 169

Named entity recognition project 171

A simple NER tool 172

Apache Board meeting minutes 173

Django IRC chat 175

GnuIRC summaries 179

LKML e-mails 182

Summary 183

Chapter 7: Automatic Text Summarization 185

What is automatic text summarization? 186

Tools for text summarization 187

Naive text summarization using NLTK 187

Text summarization using Gensim 190

Text summarization using Sumy 193

Sumy's Luhn summarizer 194

Sumy's TextRank summarizer 195

Sumy's LSA summarizer 196

Sumy's Edmundson summarizer 197

Summary 199

Chapter 8: Topic Modeling in Text 201

What is topic modeling? 202

Latent Dirichlet Allocation 203

Gensim for topic modeling 204

Understanding Gensim LDA topics 207

Understanding Gensim LDA passes 208

Applying a Gensim LDA model to new documents 210

Serializing Gensim LDA objects 21

Serializing a dictionary 21

Serializing a corpus 212

Serializing a model 213

Gensim LDA for a larger project 213

Summary 216

Table of Contents

[ v ]

Chapter 9: Mining for Data Anomalies 217

What are data anomalies? 218

Missing data 218

Locating missing data 218

Zero values 220

Fixing missing data 220

Ignore the problem rows 220

Fix the problem manually 221

Use a fabricated value 222

Use a central measure 223

Use Last Observation Carried Forward 223

Use a similar value 224

Use the most likely value 224

Data errors 224

Truncated fields 225

Data type and character set errors 226

Logic or semantic errors 227

Outliers 228

Visual mining for outliers 230

Statistical detection of outliers 231

Summary 238

Index 239

[ vii ]

Preface

Over the past decade, cheaper data storage, faster hardware, and impressive

advances in algorithms have combined to pave the way for a rapid ascendance

of data science as one of the most important opportunities in computing. While

the term data science can include everything from cleaning data and storing data

to visualizing it in graphs and charts, the area that has made the most significant

gain is the invention of intelligent and sophisticated algorithms for analyzing data.

Using computers to find the interesting patterns buried within massive amounts of

data is called data mining, an area that encompasses elements of database systems,

statistics, and machine learning.

Right now there are dozens of great data mining and machine learning books

available for software developers to get up to date on all these advances in the

field. What most of these books have in common is that they all cover a small set

of tried-and-true methods for finding patterns in data: classification, clustering,

decision trees, and regression. Of course, all of these are critically important methods

for any data miner to know and they are popular because they can be effective.

But these same few techniques are not the whole story. Data mining is a rich field

encompassing many dozens of techniques to uncover patterns and make predictions.

A true master of data mining should have many tools in her toolbox, not just a few.

Thus, the mission of this book, Mastering Data Mining with Python, is to introduce

some of the lesser-known data mining concepts that are typically only covered in

academic textbooks.

This book uses the Python programming language and a project-based approach to

introduce diverse and often overlooked data mining concepts, such as association

rules, entity matching, network analysis, text mining, and anomaly detection. Each

chapter thoroughly illustrates the basics of one particular data mining technique,

provides alternatives for evaluating its effectiveness, and then implements the

technique using real-world data.

Preface

[ viii ]

Our focus on real-world data is another feature of this book that sets it apart from

many other data mining books. The true test of whether we have mastered a concept

is whether we can apply a method to a new, unknown problem. In our case, this

means applying each data mining method to a new problem area or a new data set.

The emphasis on real data also means that our results may not always be as clean

and tidy as results that come from a canned, example data set. For this reason, each

chapter includes a discussion for how to critically evaluate the method. Do the

results make sense? What do the results mean? How can the results be improved?

So, in many ways, this book picks up where some of the other data mining books

leave off. If you want to round up your growing data mining toolbox with a set of

interesting but often overlooked techniques, then read on to learn the specific topics

we will cover and how they will be applied in each chapter.

What this book covers

Chapter 1, Expanding Your Data Mining Toolbox, gives an introduction to the field of

data mining. In this chapter we pay special attention to how data mining relates

to similar topics, such as machine learning and data science. We also review many

different data mining methodologies, and talk about their various strengths and

weaknesses. This foundational knowledge is important as we transition into the

remaining chapters of the book, which are much more technique-oriented and focus

on the application of specific data mining tools.

Chapter 2, Association Rule Mining, introduces our first data mining tool: mining

for co-occurring sets of items, sometimes called frequent itemsets. We extend our

understanding of frequent itemset mining to include mining for association rules,

and we learn how to evaluate whether the rules we have found are helpful or not.

To put our knowledge into practice, at the end of the chapter we implement a small

project wherein we find association rules in the keywords chosen to describe a large

set of software projects.

Chapter 3, Entity Matching, focuses on finding matching pairs of data elements that

may look slightly different but are actually the same. We learn how to determine

whether two items are actually the same thing by using the attributes of the data. At

the end of the chapter, we implement an entity matching project where we learn to

find the software projects that have moved from one hosting service to another, even

after changing their names and other important attributes.

Thư viện tri thức trực tuyến

Mastering Data Mining with Python – Find patterns hidden in your data

Nội dung xem thử

Mô tả chi tiết

Tài liệu tương tự (6)

Mastering data warehouse aggregates

Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 2 pdf

Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 3 ppsx

Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 4 pot

Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 6 pps

Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 5 potx