Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Predictive Analytics, Data Mining and Big Data
Nội dung xem thử
Mô tả chi tiết
9781137379276_01_prexii.indd i 5/12/2014 8:52:45 PM
Predictive Analytics, Data Mining and Big Data
9781137379276_01_prexii.indd i 5/12/2014 8:52:45 PM
This page intentionally left blank
Predictive Analytics,
Data Mining and
Big Data
Myths, Misconceptions and Methods
Steven Finlay
9781137379276_01_prexii.indd iii 5/12/2014 8:52:45 PM
© Steven Finlay 2014
All rights reserved. No reproduction, copy or transmission of this
publication may be made without written permission.
No portion of this publication may be reproduced, copied or transmitted
save with written permission or in accordance with the provisions of the
Copyright, Designs and Patents Act 1988, or under the terms of any licence
permitting limited copying issued by the Copyright Licensing Agency,
Saffron House, 6–10 Kirby Street, London EC1N 8TS.
Any person who does any unauthorized act in relation to this publication
may be liable to criminal prosecution and civil claims for damages.
The author has asserted his right to be identified as the author of this
work in accordance with the Copyright, Designs and Patents Act 1988.
First published 2014 by
PALGRAVE MACMILLAN
Palgrave Macmillan in the UK is an imprint of Macmillan Publishers Limited,
registered in England, company number 785998, of Houndmills, Basingstoke,
Hampshire RG21 6XS.
Palgrave Macmillan in the US is a division of St Martin’s Press LLC,
175 Fifth Avenue, New York, NY 10010.
Palgrave Macmillan is the global academic imprint of the above companies
and has companies and representatives throughout the world.
Palgrave® and Macmillan® are registered trademarks in the United States,
the United Kingdom, Europe and other countries.
This book is printed on paper suitable for recycling and made from fully
managed and sustained forest sources. Logging, pulping and manufacturing
processes are expected to conform to the environmental regulations of the
country of origin.
A catalogue record for this book is available from the British Library.
A catalog record for this book is available from the Library of Congress.
Typeset by MPS Limited, Chennai, India.
Softcover reprint of the hardcover 1st edition 2014 978-1-137-37927-6
ISBN 978-1-349-47868-2 ISBN 978-1-137-37928-3 (eBook)
DOI 10.1057/9781137379283
To Ruby and Samantha
9781137379276_01_prexii.indd v 5/12/2014 8:52:46 PM
This page intentionally left blank
vii
Figures and Tables x
Acknowledgments xii
1 Introduction 1
1.1 What are data mining and predictive analytics? 2
1.2 How good are models at predicting behavior? 6
1.3 What are the benefi ts of predictive models? 7
1.4 Applications of predictive analytics 9
1.5 Reaping the benefi ts, avoiding the pitfalls 11
1.6 What is Big Data? 13
1.7 How much value does Big Data add? 16
1.8 The rest of the book 19
2 Using Predictive Models 21
2.1 What are your objectives? 22
2.2 Decision making 23
2.3 The next challenge 31
2.4 Discussion 34
2.5 Override rules (business rules) 36
3 Analytics, Organization and Culture 39
3.1 Embedded analytics 40
3.2 Learning from failure 42
3.3 A lack of motivation 43
3.4 A slight misunderstanding 45
3.5 Predictive, but not precise 50
3.6 Great expectations 52
3.7 Understanding cultural resistance to predictive analytics 54
3.8 The impact of predictive analytics 60
Contents
9781137379276_01_prexii.indd vii 5/12/2014 8:52:46 PM
viii
Contents
3.9 Combining model-based predictions and human
judgment 62
4 The Value of Data 65
4.1 What type of data is predictive of behavior? 66
4.2 Added value is what’s important 70
4.3 Where does the data to build predictive
models come from? 73
4.4 The right data at the right time 76
4.5 How much data do I need to build a predictive model? 79
5 Ethics and Legislation 85
5.1 A brief introduction to ethics 86
5.2 Ethics in practice 89
5.3 The relevance of ethics in a Big Data world 90
5.4 Privacy and data ownership 92
5.5 Data security 96
5.6 Anonymity 97
5.7 Decision making 99
6 Types of Predictive Models 104
6.1 Linear models 106
6.2 Decision trees (classifi cation and regression trees) 112
6.3 (Artifi cial) neural networks 114
6.4 Support vector machines (SVMs) 118
6.5 Clustering 120
6.6 Expert systems (knowledge-based systems) 122
6.7 What type of model is best? 124
6.8 Ensemble (fusion or combination) systems 128
6.9 How much benefi t can I expect to get from using an
ensemble? 130
6.10 The prospects for better types of predictive models in
the future 131
7 The Predictive Analytics Process 134
7.1 Project initiation 135
7.2 Project requirements 138
7.3 Is predictive analytics the right tool for the job? 142
7.4 Model building and business evaluation 143
7.5 Implementation 145
9781137379276_01_prexii.indd viii 5/12/2014 8:52:46 PM
Contents ix
7.6 Monitoring and redevelopment 149
7.7 How long should a predictive analytics project take? 154
8 How to Build a Predictive Model 157
8.1 Exploring the data landscape 158
8.2 Sampling and shaping the development sample 159
8.3 Data preparation (data cleaning) 162
8.4 Creating derived data 163
8.5 Understanding the data 164
8.6 Preliminary variable selection (data reduction) 165
8.7 Pre-processing (data transformation) 166
8.8 Model construction (modeling) 170
8.9 Validation 171
8.10 Selling models into the business 172
8.11 The rise of the regulator 176
9 Text Mining and Social Network Analysis 179
9.1 Text mining 179
9.2 Using text analytics to create predictor variables 181
9.3 Within document predictors 181
9.4 Sentiment analysis 184
9.5 Across document predictors 185
9.6 Social network analysis 186
9.7 Mapping a social network 191
10 Hardware, Software and All that Jazz 194
10.1 Relational databases 197
10.2 Hadoop 200
10.3 The limitations of Hadoop 202
10.4 Do I need a Big Data solution to do predictive
analytics? 203
10.5 Soft ware for predictive analytics 206
Appendix A. Glossary of Terms 209
Appendix B. Further Sources of Information 218
Appendix C. Lift Charts and Gain Charts 223
Notes 227
Index 246
9781137379276_01_prexii.indd ix 5/12/2014 8:52:46 PM
x
Figures
1.1 Loan application model 4
1.2 Score distribution 5
2.1 A decision tree 28
3.1 The fraud process 46
3.2 The culture cycle 55
5.1 A UK data protection statement 94
5.2 Risk of ethically questionable usage 102
6.1 A linear model 106
6.2 Linear and non-linear relationships 108
6.3 A linear model using indicator variables (a scorecard) 110
6.4 A decision tree for grocery spend 113
6.5 A neuron 115
6.6 A neural network 116
6.7 Maximizing the margin 119
6.8 Clusters 121
7.1 Process for predictive analytics 136
7.2 Decision engine 147
8.1 Model construction process 158
9.1 Family network 189
10.1 Sales table 198
10.2 Many to one relationship 199
Tables
2.1 Key findings for the test campaign 26
Figures and Tables
9781137379276_01_prexii.indd x 5/12/2014 8:52:46 PM
Figures and Tables xi
2.2 Score distribution 30
2.3 Score distribution for the gross profit model 32
2.4 Using two models in combination 33
4.1 Data types 68
4.2 Data sources 74
8.1 The final data set 168
9.1 Class labels for documents 186
9781137379276_01_prexii.indd xi 5/12/2014 8:52:46 PM
xii
First and foremost I would like to thank my wife Samantha and my parents
Paul and Ann for their support, comments and proofreading services. I would
also like to thank my friend Tracy Moore for providing many useful comments
and suggestions on early drafts of the manuscript. Thanks also to the staff of
the Management Science Department at Lancaster University in the UK for
providing access to the university facilities, which proved invaluable to my
writing and research. I am also grateful to the members of the UK Government
Operational Research Service (GORS), and in particular my former colleagues
in Manchester and Liverpool, for many hours spent chewing over the finer
points of predictive analytics, Big Data and life in general during the writing
of the book.
Acknowledgments
9781137379276_01_prexii.indd xii 5/12/2014 8:52:46 PM
1
Retailers, banks, governments, social networking sites, credit reference
agencies and telecoms companies, amongst others, hold vast amounts of
information about us. They know where we live, what we spend our money
on, who our friends and family are, our likes and dislikes, our lifestyles and our
opinions. Every year the amount of electronic information about us grows as
we increasingly use internet services, social media and smart devices to move
more and more of our lives into the online environment.
Until the early 2000s the primary source of individual (consumer) data was
the electronic footprints we left behind as we moved through life, such
as credit card transactions, online purchases and requests for insurance
quotations. This information is required to generate bills, keep accounts up to
date, and to provide an audit of the transactions that have occurred between
service providers and their customers. In recent years organizations have
become increasingly interested in the spaces between our transactions and
the paths that led us to the decisions that we made. As we do more things
electronically, information that gives insights about our thought processes and
the influences that led us to engage in one activity rather than another has
become available. A retailer can gain an understanding of why we purchased
their product rather than a rival’s by examining what route we took before we
bought it – what websites did we visit? What other products did we consider?
Which reviews did we consult? Similarly, social media provides all sorts of
information about ourselves (what we think, who we talk to and what we talk
about), and our phones and other devices provide information about where
we are and where we’ve been.
chapter 1
Introduction
9781137379276_02_cha01.indd 1 5/12/2014 8:54:25 PM
2
Predictive Analytics, Data Mining and Big Data
All this information about people is incredibly useful for all sorts of different
reasons, but one application in particular is to predict future behavior. By
using information about people’s lifestyles, movements and past behaviors,
organizations can predict what they are likely to do, when they will do it and
where that activity will occur. They then use these predictions to tailor how they
interact with people. Their reason for doing this is to influence people’s behavior,
in order to maximize the value of the relationships that they have with them.
In this book I explain how predictive analytics is used to forecast what people
are likely to do and how those forecasts are used to decide how to treat
people. If your organization uses predictive analytics; if you are wondering
whether predictive analytics could improve what you do; or if you want
to find out more about how predictive models are constructed and used in
practical real-world environments, then this is the book for you.
1.1 What are data mining and predictive analytics?
By the 1980s many organizations found themselves with customer databases
that had grown to the point where the amount of data they held had become
too large for humans to be able to analyze it on their own. The term “data
mining” was coined to describe a range of automated techniques that could
be applied to interrogate these databases and make inferences about what the
data meant. If you want a concise definition of data mining, then “The analysis
of large and complex data sets” is a good place to start.
Many of the tools used to perform data mining are standard statistical methods
that have been around for decades, such as linear regression and clustering.
However, data mining also includes a wide range of other techniques for
analyzing data that grew out of research into artificial intelligence (machine
learning), evolutionary computing and game theory.
Data mining is a very broad topic, used for all sorts of things. Detecting
patterns in satellite data, anticipating stock price movements, face recognition
and forecasting traffic congestion are just a few examples of where data
mining is routinely applied. However, the most prolific use of data mining is to
identify relationships in data that give an insight into individual preferences,
and most importantly, what someone is likely to do in a given scenario.
This is important because if an organization knows what someone is likely
to do, then it can tailor its response in order to maximize its own objectives.
For commercial organizations the objective is usually to maximize profit.
9781137379276_02_cha01.indd 2 5/12/2014 8:54:25 PM