Introducing Data Science

MANNING

Davy Cielen

Arno D. B. Meysman

Mohamed Ali

Big data, machine learning, and more, using Python tools

Introducing Data Science

Introducing

Data Science

BIG DATA, MACHINE LEARNING,

AND MORE, USING PYTHON TOOLS

DAVY CIELEN

ARNO D. B. MEYSMAN

MOHAMED ALI

MANNING

SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit

www.manning.com. The publisher offers discounts on this book when ordered in quantity.

For more information, please contact

Special Sales Department

Manning Publications Co.

20 Baldwin Road

PO Box 761

Shelter Island, NY 11964

Email: orders@manning.com

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in

any form or by means electronic, mechanical, photocopying, or otherwise, without prior written

permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are

claimed as trademarks. Where those designations appear in the book, and Manning

Publications was aware of a trademark claim, the designations have been printed in initial caps

or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have

the books we publish printed on acid-free paper, and we exert our best efforts to that end.

Recognizing also our responsibility to conserve the resources of our planet, Manning books

are printed on paper that is at least 15 percent recycled and processed without the use of

elemental chlorine.

Manning Publications Co. Development editor: Dan Maharry

20 Baldwin Road Technical development editors: Michael Roberts, Jonathan Thoms

PO Box 761 Copyeditor: Katie Petito

Shelter Island, NY 11964 Proofreader: Alyson Brener

Technical proofreader: Ravishankar Rajagopalan

Typesetter: Dennis Dalinnik

Cover designer: Marija Tudor

ISBN: 9781633430037

Printed in the United States of America

1 2 3 4 5 6 7 8 9 10 – EBM – 21 20 19 18 17 16

brief contents

1 ■ Data science in a big data world 1

2 ■ The data science process 22

3 ■ Machine learning 57

4 ■ Handling large data on a single computer 85

5 ■ First steps in big data 119

6 ■ Join the NoSQL movement 150

7 ■ The rise of graph databases 190

8 ■ Text mining and text analytics 218

9 ■ Data visualization to the end user 253

vii

contents

preface xiii

acknowledgments xiv

about this book xvi

about the authors xviii

about the cover illustration xx

Data science in a big data world 1

1.1 Benefits and uses of data science and big data 2

1.2 Facets of data 4

Structured data 4 ■ Unstructured data 5

Natural language 5 ■ Machine-generated data 6

Graph-based or network data 7 ■ Audio, image, and video 8

Streaming data 8

1.3 The data science process 8

Setting the research goal 8 ■ Retrieving data 9

Data preparation 9 ■ Data exploration 9

Data modeling or model building 9 ■ Presentation

and automation 9

1.4 The big data ecosystem and data science 10

Distributed file systems 10 ■ Distributed programming

framework 12 ■ Data integration framework 12

viii CONTENTS

Machine learning frameworks 12 ■ NoSQL databases 13

Scheduling tools 14 ■ Benchmarking tools 14

System deployment 14 ■ Service programming 14

Security 14

1.5 An introductory working example of Hadoop 15

1.6 Summary 20

The data science process 22

2.1 Overview of the data science process 22

Don’t be a slave to the process 25

2.2 Step 1: Defining research goals and creating

a project charter 25

Spend time understanding the goals and context of your research 26

Create a project charter 26

2.3 Step 2: Retrieving data 27

Start with data stored within the company 28 ■ Don’t be afraid

to shop around 28 ■ Do data quality checks now to prevent

problems later 29

2.4 Step 3: Cleansing, integrating, and transforming data 29

Cleansing data 30 ■ Correct errors as early as possible 36

Combining data from different data sources 37

Transforming data 40

2.5 Step 4: Exploratory data analysis 43

2.6 Step 5: Build the models 48

Model and variable selection 48 ■ Model execution 49

Model diagnostics and model comparison 54

2.7 Step 6: Presenting findings and building applications on

top of them 55

2.8 Summary 56

Machine learning 57

3.1 What is machine learning and why should you care

about it? 58

Applications for machine learning in data science 58

Where machine learning is used in the data science process 59

Python tools used in machine learning 60

CONTENTS ix

3.2 The modeling process 62

Engineering features and selecting a model 62 ■ Training

your model 64 ■ Validating a model 64 ■ Predicting

new observations 65

3.3 Types of machine learning 65

Supervised learning 66 ■ Unsupervised learning 72

3.4 Semi-supervised learning 82

3.5 Summary 83

Handling large data on a single computer 85

4.1 The problems you face when handling large data 86

4.2 General techniques for handling large volumes of data 87

Choosing the right algorithm 88 ■ Choosing the right data

structure 96 ■ Selecting the right tools 99

4.3 General programming tips for dealing with

large data sets 101

Don’t reinvent the wheel 101 ■ Get the most out of your

hardware 102 ■ Reduce your computing needs 102

4.4 Case study 1: Predicting malicious URLs 103

Step 1: Defining the research goal 104 ■ Step 2: Acquiring

the URL data 104 ■ Step 4: Data exploration 105

Step 5: Model building 106

4.5 Case study 2: Building a recommender system inside

a database 108

Tools and techniques needed 108 ■ Step 1: Research

question 111 ■ Step 3: Data preparation 111

Step 5: Model building 115 ■ Step 6: Presentation

and automation 116

4.6 Summary 118

First steps in big data 119

5.1 Distributing data storage and processing with

frameworks 120

Hadoop: a framework for storing and processing large data sets 121

Spark: replacing MapReduce for better performance 123

x CONTENTS

5.2 Case study: Assessing risk when loaning money 125

Step 1: The research goal 126 ■ Step 2: Data retrieval 127

Step 3: Data preparation 131 ■ Step 4: Data exploration &

Step 6: Report building 135

5.3 Summary 149

Join the NoSQL movement 150

6.1 Introduction to NoSQL 153

ACID: the core principle of relational databases 153

CAP Theorem: the problem with DBs on many nodes 154

The BASE principles of NoSQL databases 156

NoSQL database types 158

6.2 Case study: What disease is that? 164

Step 1: Setting the research goal 166 ■ Steps 2 and 3: Data

retrieval and preparation 167 ■ Step 4: Data exploration 175

Step 3 revisited: Data preparation for disease profiling 183

Step 4 revisited: Data exploration for disease profiling 187

Step 6: Presentation and automation 188

6.3 Summary 189

The rise of graph databases 190

7.1 Introducing connected data and graph databases 191

Why and when should I use a graph database? 193

7.2 Introducing Neo4j: a graph database 196

Cypher: a graph query language 198

7.3 Connected data example: a recipe recommendation

engine 204

Step 1: Setting the research goal 205 ■ Step 2: Data retrieval 206

Step 3: Data preparation 207 ■ Step 4: Data exploration 210

Step 5: Data modeling 212 ■ Step 6: Presentation 216

7.4 Summary 216

Text mining and text analytics 218

8.1 Text mining in the real world 220

8.2 Text mining techniques 225

Bag of words 225 ■ Stemming and lemmatization 227

Decision tree classifier 228

CONTENTS xi

8.3 Case study: Classifying Reddit posts 230

Meet the Natural Language Toolkit 231 ■ Data science process

overview and step 1: The research goal 233 ■ Step 2: Data

retrieval 234 ■ Step 3: Data preparation 237 ■ Step 4:

Data exploration 240 ■ Step 3 revisited: Data preparation

adapted 242 ■ Step 5: Data analysis 246 ■ Step 6:

Presentation and automation 250

8.4 Summary 252

Data visualization to the end user 253

9.1 Data visualization options 254

9.2 Crossfilter, the JavaScript MapReduce library 257

Setting up everything 258 ■ Unleashing Crossfilter to filter the

medicine data set 262

9.3 Creating an interactive dashboard with dc.js 267

9.4 Dashboard development tools 272

9.5 Summary 273

appendix A Setting up Elasticsearch 275

appendix B Setting up Neo4j 281

appendix C Installing MySQL server 284

appendix D Setting up Anaconda with a virtual environment 288

index 291

xiii

preface

It’s in all of us. Data science is what makes us humans what we are today. No, not the

computer-driven data science this book will introduce you to, but the ability of our

brains to see connections, draw conclusions from facts, and learn from our past experiences. More so than any other species on the planet, we depend on our brains for

survival; we went all-in on these features to earn our place in nature. That strategy has

worked out for us so far, and we’re unlikely to change it in the near future.

But our brains can only take us so far when it comes to raw computing. Our biology can’t keep up with the amounts of data we can capture now and with the extent of

our curiosity. So we turn to machines to do part of the work for us: to recognize patterns, create connections, and supply us with answers to our numerous questions.

The quest for knowledge is in our genes. Relying on computers to do part of the

job for us is not—but it is our destiny.

xiv

acknowledgments

A big thank you to all the people of Manning involved in the process of making this

book for guiding us all the way through.

Our thanks also go to Ravishankar Rajagopalan for giving the manuscript a full

technical proofread, and to Jonathan Thoms and Michael Roberts for their expert

comments. There were many other reviewers who provided invaluable feedback

throughout the process: Alvin Raj, Arthur Zubarev, Bill Martschenko, Craig Smith,

Filip Pravica, Hamideh Iraj, Heather Campbell, Hector Cuesta, Ian Stirk, Jeff Smith,

Joel Kotarski, Jonathan Sharley, Jörn Dinkla, Marius Butuc, Matt R. Cole, Matthew

Heck, Meredith Godar, Rob Agle, Scott Chaussee, and Steve Rogers.

First and foremost I want to thank my wife Filipa for being my inspiration and motivation to beat all difficulties and for always standing beside me throughout my career

and the writing of this book. She has provided me the necessary time to pursue my

goals and ambition, and shouldered all the burdens of taking care of our little daughter in my absence. I dedicate this book to her and really appreciate all the sacrifices

she has made in order to build and maintain our little family.

I also want to thank my daughter Eva, and my son to be born, who give me a great

sense of joy and keep me smiling. They are the best gifts that God ever gave to my life and

also the best children a dad could hope for: fun, loving, and always a joy to be with.

A special thank you goes to my parents for their support over the years. Without

the endless love and encouragement from my family, I would not have been able to

finish this book and continue the journey of achieving my goals in life.

Thư viện tri thức trực tuyến

Nội dung xem thử

Mô tả chi tiết

Tài liệu tương tự (6)

Introducing Data Science BIG DATA, MACHINE LEARNING, AND MORE,data science roadmap book

Introducing data science big data, machine learning and more, using python tools (2016)

Introducing SQL A Foundation of Data Analytics Workshop Introducing

Introducing anthropology: an integrated approach

Introducing HTML5 (Voices That Matter)

Introducing Microsoft WebMatrix