Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Machine Learning For Dummies®
Nội dung xem thử
Mô tả chi tiết
Machine Learning
by John Paul Mueller
and Luca Massaron
Machine Learning For Dummies®
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright © 2016 by John Wiley & Sons, Inc., Hoboken, New Jersey
Media and software compilation copyright © 2016 by John Wiley & Sons, Inc. All rights reserved.
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections
107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to
the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River
Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related
trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written
permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not
associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO
REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS
OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION
WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY
SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE
SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS
NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL
ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT.
NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT
THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL
SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE
INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE.
FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE
CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care Department within
the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit
www.wiley.com/techsupport.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with
standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to
media such as a CD or DVD that is not included in the version you purchased, you may download this material at
http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Control Number: 2016940023
ISBN: 978-1-119-24551-3
ISBN 978-1-119-24577-3 (ebk); ISBN ePDF 978-1-119-24575-9 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
Contents at a Glance
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Part 1: Introducing How Machines Learn . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 1: Getting the Real Story about AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER 2: Learning in the Age of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CHAPTER 3: Having a Glance at the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Part 2: Preparing Your Learning Tools . . . . . . . . . . . . . . . . . . . . . . . . 45
CHAPTER 4: Installing an R Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
CHAPTER 5: Coding in R Using RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
CHAPTER 6: Installing a Python Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
CHAPTER 7: Coding in Python Using Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
CHAPTER 8: Exploring Other Machine Learning Tools . . . . . . . . . . . . . . . . . . . . . . . . 137
Part 3: Getting Started with the Math Basics . . . . . . . . . . . . . . 145
CHAPTER 9: Demystifying the Math Behind Machine Learning . . . . . . . . . . . . . . . . 147
CHAPTER 10: Descending the Right Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
CHAPTER 11: Validating Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
CHAPTER 12: Starting with Simple Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Part 4: Learning from Smart and Big Data . . . . . . . . . . . . . . . . . 217
CHAPTER 13: Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
CHAPTER 14: Leveraging Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
CHAPTER 15: Working with Linear Models the Easy Way . . . . . . . . . . . . . . . . . . . . . . . 257
CHAPTER 16: Hitting Complexity with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 279
CHAPTER 17: Going a Step beyond Using Support Vector Machines . . . . . . . . . . . . . 297
CHAPTER 18: Resorting to Ensembles of Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Part 5: Applying Learning to Real Problems . . . . . . . . . . . . . . . 331
CHAPTER 19: Classifying Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
CHAPTER 20: Scoring Opinions and Sentiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
CHAPTER 21: Recommending Products and Movies . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Part 6: The Part of Tens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
CHAPTER 22: Ten Machine Learning Packages to Master . . . . . . . . . . . . . . . . . . . . . . . 385
CHAPTER 23: Ten Ways to Improve Your Machine Learning Models . . . . . . . . . . . . . 391
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Table of Contents
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Foolish Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Icons Used in This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Beyond the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Where to Go from Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
PART 1: INTRODUCING HOW MACHINES LEARN . . . . . . . . . . . 7
CHAPTER 1: Getting the Real Story about AI . . . . . . . . . . . . . . . . . . . . . . . . 9
Moving beyond the Hype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Dreaming of Electric Sheep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Understanding the history of AI and machine learning . . . . . . . . . 12
Exploring what machine learning can do for AI . . . . . . . . . . . . . . . 13
Considering the goals of machine learning . . . . . . . . . . . . . . . . . . . 13
Defining machine learning limits based on hardware . . . . . . . . . . 14
Overcoming AI Fantasies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Discovering the fad uses of AI and machine learning . . . . . . . . . . 16
Considering the true uses of AI and machine learning . . . . . . . . . 16
Being useful; being mundane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Considering the Relationship between AI and Machine Learning . . . 19
Considering AI and Machine Learning Specifications . . . . . . . . . . . . . 20
Defining the Divide between Art and Engineering . . . . . . . . . . . . . . . . 20
CHAPTER 2: Learning in the Age of Big Data . . . . . . . . . . . . . . . . . . . . . . . 23
Defining Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Considering the Sources of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Building a new data source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Using existing data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Locating test data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Specifying the Role of Statistics in Machine Learning . . . . . . . . . . . . . 29
Understanding the Role of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Defining what algorithms do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Considering the five main techniques . . . . . . . . . . . . . . . . . . . . . . . 30
Defining What Training Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
CHAPTER 3: Having a Glance at the Future . . . . . . . . . . . . . . . . . . . . . . . . 35
Creating Useful Technologies for the Future . . . . . . . . . . . . . . . . . . . . 36
Considering the role of machine learning in robots.............36
Using machine learning in health care . . . . . . . . . . . . . . . . . . . . . . . 37
Creating smart systems for various needs . . . . . . . . . . . . . . . . . . . 37
Table of Contents v
Using machine learning in industrial settings . . . . . . . . . . . . . . . . . 38
Understanding the role of updated processors
and other hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Discovering the New Work Opportunities with
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Working for a machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Working with machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Repairing machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Creating new machine learning tasks.........................42
Devising new machine learning environments . . . . . . . . . . . . . . . . 42
Avoiding the Potential Pitfalls of Future Technologies . . . . . . . . . . . . 43
PART 2: PREPARING YOUR LEARNING TOOLS . . . . . . . . . . . . . . 45
CHAPTER 4: Installing an R Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Choosing an R Distribution with Machine Learning in Mind . . . . . . . . 48
Installing R on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Installing R on Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Installing R on Mac OS X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Downloading the Datasets and Example Code . . . . . . . . . . . . . . . . . . . 59
Understanding the datasets used in this book . . . . . . . . . . . . . . . . 59
Defining the code repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
CHAPTER 5: Coding in R Using RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Understanding the Basic Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Working with Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Organizing Data Using Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Working with Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Creating a basic matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Changing the vector arrangement . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Accessing individual elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Naming the rows and columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Interacting with Multiple Dimensions Using Arrays . . . . . . . . . . . . . . . 71
Creating a basic array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Naming the rows and columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Creating a Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Understanding factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Creating a basic data frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Interacting with data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Expanding a data frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Performing Basic Statistical Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Making decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Working with loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
vi Machine Learning For Dummies
Performing looped tasks without loops . . . . . . . . . . . . . . . . . . . . . . 84
Working with functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Finding mean and median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Charting your data.........................................87
CHAPTER 6: Installing a Python Distribution . . . . . . . . . . . . . . . . . . . . . . 89
Choosing a Python Distribution with Machine Learning in Mind . . . . 90
Getting Continuum Analytics Anaconda . . . . . . . . . . . . . . . . . . . . . 91
Getting Enthought Canopy Express . . . . . . . . . . . . . . . . . . . . . . . . . 92
Getting pythonxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Getting WinPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Installing Python on Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Installing Python on Mac OS X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Installing Python on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Downloading the Datasets and Example Code . . . . . . . . . . . . . . . . . . . 99
Using Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Defining the code repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Understanding the datasets used in this book . . . . . . . . . . . . . . . 106
CHAPTER 7: Coding in Python Using Anaconda . . . . . . . . . . . . . . . . . . 109
Working with Numbers and Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Performing variable assignments . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Doing arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Comparing data using Boolean expressions . . . . . . . . . . . . . . . . . 115
Creating and Using Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Interacting with Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Creating and Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Creating reusable functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Calling functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Working with global and local variables . . . . . . . . . . . . . . . . . . . . . 123
Using Conditional and Loop Statements . . . . . . . . . . . . . . . . . . . . . . . 124
Making decisions using the if statement . . . . . . . . . . . . . . . . . . . . 124
Choosing between multiple options using nested decisions . . . 125
Performing repetitive tasks using for . . . . . . . . . . . . . . . . . . . . . . . 126
Using the while statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Storing Data Using Sets, Lists, and Tuples . . . . . . . . . . . . . . . . . . . . . . 128
Creating sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Performing operations on sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Creating lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Creating and using tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Defining Useful Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Indexing Data Using Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Storing Code in Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Table of Contents vii
CHAPTER 8: Exploring Other Machine Learning Tools . . . . . . . . . 137
Meeting the Precursors SAS, Stata, and SPSS . . . . . . . . . . . . . . . . . . . 138
Learning in Academia with Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Accessing Complex Algorithms Easily Using LIBSVM . . . . . . . . . . . . . 141
Running As Fast As Light with Vowpal Wabbit . . . . . . . . . . . . . . . . . . 142
Visualizing with Knime and RapidMiner . . . . . . . . . . . . . . . . . . . . . . . . 143
Dealing with Massive Data by Using Spark . . . . . . . . . . . . . . . . . . . . . 144
PART 3: GETTING STARTED WITH THE MATH BASICS . . . . 145
CHAPTER 9: Demystifying the Math Behind
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Working with Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Creating a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Understanding basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Performing matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Glancing at advanced matrix operations . . . . . . . . . . . . . . . . . . . . 155
Using vectorization effectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Exploring the World of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Operating on probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Conditioning chance by Bayes’ theorem . . . . . . . . . . . . . . . . . . . . 160
Describing the Use of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
CHAPTER 10: Descending the Right Curve . . . . . . . . . . . . . . . . . . . . . . . . . 167
Interpreting Learning As Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 168
Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
The learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Exploring Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Descending the Error Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Updating by Mini-Batch and Online . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
CHAPTER 11: Validating Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 181
Checking Out-of-Sample Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Looking for generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Getting to Know the Limits of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Keeping Model Complexity in Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Keeping Solutions Balanced...................................188
Depicting learning curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Training, Validating, and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Resorting to Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Looking for Alternatives in Validation...........................193
viii Machine Learning For Dummies
Optimizing Cross-Validation Choices . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Exploring the space of hyper-parameters . . . . . . . . . . . . . . . . . . . 195
Avoiding Sample Bias and Leakage Traps . . . . . . . . . . . . . . . . . . . . . . 196
Watching out for snooping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
CHAPTER 12: Starting with Simple Learners . . . . . . . . . . . . . . . . . . . . . . . 199
Discovering the Incredible Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 200
Falling short of a miracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Touching the nonseparability limit . . . . . . . . . . . . . . . . . . . . . . . . . 202
Growing Greedy Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Predicting outcomes by splitting data . . . . . . . . . . . . . . . . . . . . . . 204
Pruning overgrown trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Taking a Probabilistic Turn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Understanding Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Estimating response with Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . 212
PART 4: LEARNING FROM SMART AND BIG DATA . . . . . . . 217
CHAPTER 13: Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Gathering and Cleaning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Repairing Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Identifying missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Choosing the right replacement strategy . . . . . . . . . . . . . . . . . . . 222
Transforming Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Creating Your Own Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Understanding the need to create features . . . . . . . . . . . . . . . . . 227
Creating features automatically . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Compressing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Delimiting Anomalous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
CHAPTER 14: Leveraging Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Measuring Similarity between Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 238
Understanding similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Computing distances for learning . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Using Distances to Locate Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Checking assumptions and expectations.....................241
Inspecting the gears of the algorithm . . . . . . . . . . . . . . . . . . . . . . 243
Tuning the K-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Experimenting K-means reliability . . . . . . . . . . . . . . . . . . . . . . . . . 245
Experimenting with how centroids converge . . . . . . . . . . . . . . . . 247
Searching for Classification by K-Nearest Neighbors . . . . . . . . . . . . . 251
Leveraging the Correct K Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Understanding the k parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Experimenting with a flexible algorithm . . . . . . . . . . . . . . . . . . . . 253
Table of Contents ix
CHAPTER 15: Working with Linear Models the Easy Way . . . . . . . 257
Starting to Combine Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Mixing Variables of Different Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Switching to Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Specifying a binary response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Handling multiple classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Guessing the Right Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Defining the outcome of features that don’t work together . . . . 271
Solving overfitting by using selection . . . . . . . . . . . . . . . . . . . . . . . 272
Learning One Example at a Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Using gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Understanding how SGD is different . . . . . . . . . . . . . . . . . . . . . . . 275
CHAPTER 16: Hitting Complexity with Neural Networks . . . . . . . 279
Learning and Imitating from Nature............................280
Going forth with feed-forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Going even deeper down the rabbit hole . . . . . . . . . . . . . . . . . . . 283
Getting Back with Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 286
Struggling with Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Understanding the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Opening the black box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Introducing Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
CHAPTER 17: Going a Step beyond Using Support
Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Revisiting the Separation Problem: A New Approach . . . . . . . . . . . . 298
Explaining the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Getting into the math of an SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Avoiding the pitfalls of nonseparability . . . . . . . . . . . . . . . . . . . . . 302
Applying Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Demonstrating the kernel trick by example . . . . . . . . . . . . . . . . . 305
Discovering the different kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Illustrating Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Classifying and Estimating with SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 309
CHAPTER 18:Resorting to Ensembles of Learners . . . . . . . . . . . . . . . 315
Leveraging Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Growing a forest of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Understanding the importance measures . . . . . . . . . . . . . . . . . . . 321
Working with Almost Random Guesses . . . . . . . . . . . . . . . . . . . . . . . . 324
Bagging predictors with Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . 324
Boosting Smart Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Meeting again with gradient descent . . . . . . . . . . . . . . . . . . . . . . . 328
Averaging Different Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
x Machine Learning For Dummies
PART 5: APPLYING LEARNING TO REAL PROBLEMS . . . . . 331
CHAPTER 19: Classifying Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Working with a Set of Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Extracting Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Recognizing Faces Using Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Classifying Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
CHAPTER 20: Scoring Opinions and Sentiments . . . . . . . . . . . . . . . . . . 349
Introducing Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 349
Understanding How Machines Read . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Processing and enhancing text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Scraping textual datasets from the web . . . . . . . . . . . . . . . . . . . . 357
Handling problems with raw text . . . . . . . . . . . . . . . . . . . . . . . . . . 360
Using Scoring and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
Performing classification tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
Analyzing reviews from e-commerce . . . . . . . . . . . . . . . . . . . . . . . 365
CHAPTER 21:Recommending Products and Movies . . . . . . . . . . . . . 369
Realizing the Revolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Downloading Rating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Trudging through the MovieLens dataset . . . . . . . . . . . . . . . . . . . 371
Navigating through anonymous web data . . . . . . . . . . . . . . . . . . 373
Encountering the limits of rating data . . . . . . . . . . . . . . . . . . . . . . 374
Leveraging SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Considering the origins of SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Understanding the SVD connection . . . . . . . . . . . . . . . . . . . . . . . . 377
Seeing SVD in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
PART 6: THE PART OF TENS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
CHAPTER 22: Ten Machine Learning Packages to Master . . . . . . . 385
Cloudera Oryx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
CUDA-Convnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
ConvNetJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
e1071 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
gbm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
Gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
glmnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
randomForest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
SciPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Table of Contents xi
CHAPTER 23: Ten Ways to Improve Your Machine
Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Studying Learning Curves.....................................392
Using Cross-Validation Correctly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Choosing the Right Error or Score Metric . . . . . . . . . . . . . . . . . . . . . . 394
Searching for the Best Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . 395
Testing Multiple Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Averaging Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Stacking Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Applying Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Selecting Features and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Looking for More Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
xii Machine Learning For Dummies
Introduction 1
Introduction
The term machine learning has all sorts of meanings attached to it today,
especially after Hollywood’s (and others’) movie studios have gotten into
the picture. Films such as Ex Machina have tantalized the imaginations of
moviegoers the world over and made machine learning into all sorts of things that
it really isn’t. Of course, most of us have to live in the real world, where machine
learning actually does perform an incredible array of tasks that have nothing to do
with androids that can pass the Turing Test (fooling their makers into believing
they’re human). Machine Learning For Dummies provides you with a view of machine
learning in the real world and exposes you to the amazing feats you really can
perform using this technology. Even though the tasks that you perform using
machine learning may seem a bit mundane when compared to the movie version,
by the time you finish this book, you realize that these mundane tasks have the
power to impact the lives of everyone on the planet in nearly every aspect of their
daily lives. In short, machine learning is an incredible technology — just not in
the way that some people have imagined.
About This Book
The main purpose of Machine Learning For Dummies is to help you understand what
machine learning can and can’t do for you today and what it might do for you in
the future. You don’t have to be a computer scientist to use this book, even though
it does contain many coding examples. In fact, you can come from any discipline
that heavily emphasizes math because that’s how this book focuses on machine
learning. Instead of dealing with abstractions, you see the concrete results of
using specific algorithms to interact with big data in particular ways to obtain a
certain, useful result. The emphasis is on useful because machine learning has the
power to perform a wide array of tasks in a manner never seen before.
Part of the emphasis of this book is on using the right tools. This book uses both
Python and R to perform various tasks. These two languages have special features
that make them particularly useful in a machine learning setting. For example,
Python provides access to a huge array of libraries that let you do just about anything you can imagine and more than a few you can’t. Likewise, R provides an ease
of use that few languages can match. Machine Learning For Dummies helps you understand that both languages have their role to play and gives examples of when one
language works a bit better than the other to achieve the goals you have in mind.