Predictive Analytics, Data Mining and Big Data

9781137379276_01_prexii.indd i 5/12/2014 8:52:45 PM

9781137379276_01_prexii.indd i 5/12/2014 8:52:45 PM

This page intentionally left blank

Predictive Analytics,

Data Mining and

Big Data

Myths, Misconceptions and Methods

Steven Finlay

9781137379276_01_prexii.indd iii 5/12/2014 8:52:45 PM

publication may be made without written permission.

No portion of this publication may be reproduced, copied or transmitted

save with written permission or in accordance with the provisions of the

permitting limited copying issued by the Copyright Licensing Agency,

Saffron House, 6–10 Kirby Street, London EC1N 8TS.

Any person who does any unauthorized act in relation to this publication

may be liable to criminal prosecution and civil claims for damages.

The author has asserted his right to be identified as the author of this

work in accordance with the Copyright, Designs and Patents Act 1988.

First published 2014 by

PALGRAVE MACMILLAN

Palgrave Macmillan in the UK is an imprint of Macmillan Publishers Limited,

registered in England, company number 785998, of Houndmills, Basingstoke,

Hampshire RG21 6XS.

Palgrave Macmillan in the US is a division of St Martin’s Press LLC,

175 Fifth Avenue, New York, NY 10010.

Palgrave Macmillan is the global academic imprint of the above companies

and has companies and representatives throughout the world.

Palgrave® and Macmillan® are registered trademarks in the United States,

the United Kingdom, Europe and other countries.

This book is printed on paper suitable for recycling and made from fully

managed and sustained forest sources. Logging, pulping and manufacturing

processes are expected to conform to the environmental regulations of the

country of origin.

A catalogue record for this book is available from the British Library.

A catalog record for this book is available from the Library of Congress.

Typeset by MPS Limited, Chennai, India.

Softcover reprint of the hardcover 1st edition 2014 978-1-137-37927-6

ISBN 978-1-349-47868-2 ISBN 978-1-137-37928-3 (eBook)

DOI 10.1057/9781137379283

To Ruby and Samantha

9781137379276_01_prexii.indd v 5/12/2014 8:52:46 PM

This page intentionally left blank

vii

Figures and Tables x

Acknowledgments xii

1 Introduction 1

1.1 What are data mining and predictive analytics? 2

1.2 How good are models at predicting behavior? 6

1.3 What are the benefi ts of predictive models? 7

1.4 Applications of predictive analytics 9

1.5 Reaping the benefi ts, avoiding the pitfalls 11

1.6 What is Big Data? 13

1.7 How much value does Big Data add? 16

1.8 The rest of the book 19

2 Using Predictive Models 21

2.1 What are your objectives? 22

2.2 Decision making 23

2.3 The next challenge 31

2.4 Discussion 34

2.5 Override rules (business rules) 36

3 Analytics, Organization and Culture 39

3.1 Embedded analytics 40

3.2 Learning from failure 42

3.3 A lack of motivation 43

3.4 A slight misunderstanding 45

3.5 Predictive, but not precise 50

3.6 Great expectations 52

3.7 Understanding cultural resistance to predictive analytics 54

3.8 The impact of predictive analytics 60

Contents

9781137379276_01_prexii.indd vii 5/12/2014 8:52:46 PM

viii

Contents

3.9 Combining model-based predictions and human

judgment 62

4 The Value of Data 65

4.1 What type of data is predictive of behavior? 66

4.2 Added value is what’s important 70

4.3 Where does the data to build predictive

models come from? 73

4.4 The right data at the right time 76

4.5 How much data do I need to build a predictive model? 79

5 Ethics and Legislation 85

5.1 A brief introduction to ethics 86

5.2 Ethics in practice 89

5.3 The relevance of ethics in a Big Data world 90

5.4 Privacy and data ownership 92

5.5 Data security 96

5.6 Anonymity 97

5.7 Decision making 99

6 Types of Predictive Models 104

6.1 Linear models 106

6.2 Decision trees (classifi cation and regression trees) 112

6.3 (Artifi cial) neural networks 114

6.4 Support vector machines (SVMs) 118

6.5 Clustering 120

6.6 Expert systems (knowledge-based systems) 122

6.7 What type of model is best? 124

6.8 Ensemble (fusion or combination) systems 128

6.9 How much benefi t can I expect to get from using an

ensemble? 130

6.10 The prospects for better types of predictive models in

the future 131

7 The Predictive Analytics Process 134

7.1 Project initiation 135

7.2 Project requirements 138

7.3 Is predictive analytics the right tool for the job? 142

7.4 Model building and business evaluation 143

7.5 Implementation 145

9781137379276_01_prexii.indd viii 5/12/2014 8:52:46 PM

Contents ix

7.6 Monitoring and redevelopment 149

7.7 How long should a predictive analytics project take? 154

8 How to Build a Predictive Model 157

8.1 Exploring the data landscape 158

8.2 Sampling and shaping the development sample 159

8.3 Data preparation (data cleaning) 162

8.4 Creating derived data 163

8.5 Understanding the data 164

8.6 Preliminary variable selection (data reduction) 165

8.7 Pre-processing (data transformation) 166

8.8 Model construction (modeling) 170

8.9 Validation 171

8.10 Selling models into the business 172

8.11 The rise of the regulator 176

9 Text Mining and Social Network Analysis 179

9.1 Text mining 179

9.2 Using text analytics to create predictor variables 181

9.3 Within document predictors 181

9.4 Sentiment analysis 184

9.5 Across document predictors 185

9.6 Social network analysis 186

9.7 Mapping a social network 191

10 Hardware, Software and All that Jazz 194

10.1 Relational databases 197

10.2 Hadoop 200

10.3 The limitations of Hadoop 202

10.4 Do I need a Big Data solution to do predictive

analytics? 203

10.5 Soft ware for predictive analytics 206

Appendix A. Glossary of Terms 209

Appendix B. Further Sources of Information 218

Appendix C. Lift Charts and Gain Charts 223

Notes 227

Index 246

9781137379276_01_prexii.indd ix 5/12/2014 8:52:46 PM

Figures

1.1 Loan application model 4

1.2 Score distribution 5

2.1 A decision tree 28

3.1 The fraud process 46

3.2 The culture cycle 55

5.1 A UK data protection statement 94

5.2 Risk of ethically questionable usage 102

6.1 A linear model 106

6.2 Linear and non-linear relationships 108

6.3 A linear model using indicator variables (a scorecard) 110

6.4 A decision tree for grocery spend 113

6.5 A neuron 115

6.6 A neural network 116

6.7 Maximizing the margin 119

6.8 Clusters 121

7.1 Process for predictive analytics 136

7.2 Decision engine 147

8.1 Model construction process 158

9.1 Family network 189

10.1 Sales table 198

10.2 Many to one relationship 199

Tables

2.1 Key findings for the test campaign 26

Figures and Tables

9781137379276_01_prexii.indd x 5/12/2014 8:52:46 PM

Figures and Tables xi

2.2 Score distribution 30

2.3 Score distribution for the gross profit model 32

2.4 Using two models in combination 33

4.1 Data types 68

4.2 Data sources 74

8.1 The final data set 168

9.1 Class labels for documents 186

9781137379276_01_prexii.indd xi 5/12/2014 8:52:46 PM

xii

First and foremost I would like to thank my wife Samantha and my parents

Paul and Ann for their support, comments and proofreading services. I would

also like to thank my friend Tracy Moore for providing many useful comments

and suggestions on early drafts of the manuscript. Thanks also to the staff of

the Management Science Department at Lancaster University in the UK for

providing access to the university facilities, which proved invaluable to my

writing and research. I am also grateful to the members of the UK Government

Operational Research Service (GORS), and in particular my former colleagues

in Manchester and Liverpool, for many hours spent chewing over the finer

points of predictive analytics, Big Data and life in general during the writing

of the book.

Acknowledgments

9781137379276_01_prexii.indd xii 5/12/2014 8:52:46 PM

Retailers, banks, governments, social networking sites, credit reference

agencies and telecoms companies, amongst others, hold vast amounts of

information about us. They know where we live, what we spend our money

on, who our friends and family are, our likes and dislikes, our lifestyles and our

opinions. Every year the amount of electronic information about us grows as

we increasingly use internet services, social media and smart devices to move

more and more of our lives into the online environment.

Until the early 2000s the primary source of individual (consumer) data was

the electronic footprints we left behind as we moved through life, such

as credit card transactions, online purchases and requests for insurance

quotations. This information is required to generate bills, keep accounts up to

date, and to provide an audit of the transactions that have occurred between

service providers and their customers. In recent years organizations have

become increasingly interested in the spaces between our transactions and

the paths that led us to the decisions that we made. As we do more things

electronically, information that gives insights about our thought processes and

the influences that led us to engage in one activity rather than another has

become available. A retailer can gain an understanding of why we purchased

their product rather than a rival’s by examining what route we took before we

bought it – what websites did we visit? What other products did we consider?

Which reviews did we consult? Similarly, social media provides all sorts of

information about ourselves (what we think, who we talk to and what we talk

about), and our phones and other devices provide information about where

we are and where we’ve been.

chapter 1

Introduction

9781137379276_02_cha01.indd 1 5/12/2014 8:54:25 PM

Predictive Analytics, Data Mining and Big Data

All this information about people is incredibly useful for all sorts of different

reasons, but one application in particular is to predict future behavior. By

using information about people’s lifestyles, movements and past behaviors,

organizations can predict what they are likely to do, when they will do it and

where that activity will occur. They then use these predictions to tailor how they

interact with people. Their reason for doing this is to influence people’s behavior,

in order to maximize the value of the relationships that they have with them.

In this book I explain how predictive analytics is used to forecast what people

are likely to do and how those forecasts are used to decide how to treat

people. If your organization uses predictive analytics; if you are wondering

whether predictive analytics could improve what you do; or if you want

to find out more about how predictive models are constructed and used in

practical real-world environments, then this is the book for you.

1.1 What are data mining and predictive analytics?

By the 1980s many organizations found themselves with customer databases

that had grown to the point where the amount of data they held had become

too large for humans to be able to analyze it on their own. The term “data

mining” was coined to describe a range of automated techniques that could

be applied to interrogate these databases and make inferences about what the

data meant. If you want a concise definition of data mining, then “The analysis

of large and complex data sets” is a good place to start.

Many of the tools used to perform data mining are standard statistical methods

that have been around for decades, such as linear regression and clustering.

However, data mining also includes a wide range of other techniques for

analyzing data that grew out of research into artificial intelligence (machine

learning), evolutionary computing and game theory.

Data mining is a very broad topic, used for all sorts of things. Detecting

patterns in satellite data, anticipating stock price movements, face recognition

and forecasting traffic congestion are just a few examples of where data

mining is routinely applied. However, the most prolific use of data mining is to

identify relationships in data that give an insight into individual preferences,

and most importantly, what someone is likely to do in a given scenario.

This is important because if an organization knows what someone is likely

to do, then it can tailor its response in order to maximize its own objectives.

For commercial organizations the objective is usually to maximize profit.

9781137379276_02_cha01.indd 2 5/12/2014 8:54:25 PM

Thư viện tri thức trực tuyến

Predictive Analytics, Data Mining and Big Data

Nội dung xem thử

Mô tả chi tiết

Tài liệu tương tự (6)

Predictive Analytics and Data Mining

Predictive analytics and data mining concepts and practice with rapidminer kotu deshpande 2014 12

Data Mining and Predictive Analytics (Wiley Series on Methods and Applications in Data Mining)

Data mining and predictive analytics

Data Science and Predictive Analytics

IT training data mining and predictive analytics larose larose 2015 03 16