Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Practical Java Machine Learning
Nội dung xem thử
Mô tả chi tiết
Practical Java
Machine
Learning
Projects with Google Cloud Platform and
Amazon Web Services
—
Mark Wickham
Practical Java
Machine Learning
Projects with Google Cloud
Platform and Amazon Web Services
Mark Wickham
Practical Java Machine Learning: Projects with Google Cloud Platform and
Amazon Web Services
ISBN-13 (pbk): 978-1-4842-3950-6 ISBN-13 (electronic): 978-1-4842-3951-3
https://doi.org/10.1007/978-1-4842-3951-3
Library of Congress Control Number: 2018960994
Copyright © 2018 by Mark Wickham
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the
trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Steve Anglin
Development Editor: Matthew Moodie
Coordinating Editor: Mark Powers
Cover designed by eStudioCalamar
Cover image designed by Freepik (www.freepik.com)
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street,
6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springersbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member
(owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a
Delaware corporation.
For information on translations, please e-mail [email protected]; for reprint, paperback, or audio rights,
please email [email protected].
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and
licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales
web page at www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to
readers on GitHub via the book's product page, located at www.apress.com/9781484239506. For more
detailed information, please visit www.apress.com/source-code.
Printed on acid-free paper
Mark Wickham
Irving, TX, USA
iii
Table of Contents
Chapter 1: Introduction 1
1.1 Terminology ............................................................................................................................. 1
1.2 Historical ................................................................................................................................. 5
1.3 Machine Learning Business Case ........................................................................................... 7
Machine Learning Hype ........................................................................................................... 7
Challenges and Concerns ........................................................................................................ 8
Data Science Platforms ........................................................................................................... 9
ML Monetization .................................................................................................................... 13
The Case for Classic Machine Learning on Mobile ................................................................ 14
1.4 Deep Learning ....................................................................................................................... 18
Identifying DL Applications .................................................................................................... 19
1.5 ML-Gates Methodology ......................................................................................................... 22
ML-Gate 6: Identify the Well-Defined Problem ...................................................................... 23
ML-Gate 5: Acquire Sufficient Data ....................................................................................... 24
ML-Gate 4: Process/Clean/Visualize the Data ....................................................................... 25
ML-Gate 3: Generate a Model ................................................................................................ 25
ML-Gate 2: Test/Refine the Model ......................................................................................... 25
ML-Gate 1: Integrate the Model ............................................................................................. 26
ML-Gate 0: Deployment ......................................................................................................... 26
Methodology Summary ......................................................................................................... 27
1.6 The Case for Java .................................................................................................................. 27
Java Market ........................................................................................................................... 27
Java Versions ......................................................................................................................... 29
About the Author xi
About the Technical Reviewer xiii
Preface xv
iv
Installing Java ....................................................................................................................... 31
Java Performance ................................................................................................................. 33
1.7 Development Environments .................................................................................................. 35
Android Studio ....................................................................................................................... 36
Eclipse ................................................................................................................................... 39
Net Beans IDE ........................................................................................................................ 43
1.8 Competitive Advantage ......................................................................................................... 44
Standing on the Shoulders of Giants ..................................................................................... 44
Bridging Domains .................................................................................................................. 45
1.9 Chapter Summary ................................................................................................................. 46
Key Findings .......................................................................................................................... 46
Chapter 2: Data: The Fuel for Machine Learning 47
2.1 Megatrends ........................................................................................................................... 48
Explosion of Data ................................................................................................................... 48
Highly Scalable Computing Resources .................................................................................. 51
Advancement in Algorithms ................................................................................................... 52
2.2 Think Like a Data Scientist .................................................................................................... 52
Data Nomenclature ................................................................................................................ 53
Defining Data ......................................................................................................................... 54
2.3 Data Formats ........................................................................................................................ 55
CSV Files and Apache OpenOffice ......................................................................................... 57
ARFF Files .............................................................................................................................. 62
JSON ...................................................................................................................................... 63
2.4 JSON Integration ................................................................................................................... 69
JSON with Android SDK ......................................................................................................... 69
JSON with Java JDK .............................................................................................................. 70
2.5 Data Preprocessing ............................................................................................................... 72
Instances, Attributes, Labels, and Features ........................................................................... 73
Data Type Identification ......................................................................................................... 74
Missing Values and Duplicates .............................................................................................. 74
Erroneous Values and Outliers ............................................................................................... 76
Table of Contents
v
Macro Processing with OpenOffice Calc ............................................................................... 77
JSON Validation ..................................................................................................................... 79
2.6 Creating Your Own Data ........................................................................................................ 80
Wifi Gathering ........................................................................................................................ 80
2.7 Visualization .......................................................................................................................... 84
JavaScript Visualization Libraries .......................................................................................... 84
D3 Plus .................................................................................................................................. 86
2.8 Project: D3 Visualization ........................................................................................................ 86
2.9 Project: Android Data Visualization ....................................................................................... 97
2.10 Summary........................................................................................................................... 102
Key Data Findings ................................................................................................................ 103
Chapter 3: Leveraging Cloud Platforms 105
3.1 Introduction ......................................................................................................................... 105
Commercial Cloud Providers ............................................................................................... 106
Competitive Positioning ....................................................................................................... 109
Pricing ................................................................................................................................. 110
3.2 Google Cloud Platform (GCP) ............................................................................................... 112
Google Compute Engine (GCE) Virtual Machines (VM) ......................................................... 114
Google Cloud SDK ................................................................................................................ 116
Google Cloud Client Libraries .............................................................................................. 120
Cloud Tools for Eclipse (CT4E) ............................................................................................. 120
GCP Cloud Machine Learning Engine (ML Engine) ............................................................... 121
GCP Free Tier Pricing Details ............................................................................................... 122
3.3 Amazon AWS ....................................................................................................................... 123
AWS Machine Learning........................................................................................................ 124
AWS ML Building and Deploying Models ............................................................................. 126
AWS EC2 AMI ....................................................................................................................... 131
Running Weka ML in the AWS Cloud .................................................................................... 135
AWS SageMaker .................................................................................................................. 141
AWS SDK for Java ................................................................................................................ 143
AWS Free Tier Pricing Details .............................................................................................. 147
Table of Contents
vi
3.4 Machine Learning APIs........................................................................................................ 148
Using ML REST APIs............................................................................................................. 150
Alternative ML API Providers ............................................................................................... 151
3.5 Project: GCP Cloud Speech API for Android ......................................................................... 152
Cloud Speech API App Overview .......................................................................................... 153
GCP Machine Learning APIs ................................................................................................. 155
Cloud Speech API Authentication ......................................................................................... 156
Android Audio ...................................................................................................................... 161
Cloud Speech API App Summary ......................................................................................... 165
3.6 Cloud Data for Machine Learning ........................................................................................ 166
Unstructured Data ............................................................................................................... 167
NoSQL Databases ................................................................................................................ 168
NoSQL Data Store Methods ................................................................................................. 170
Apache Cassandra Java Interface ....................................................................................... 172
3.7 Cloud Platform Summary .................................................................................................... 175
Chapter 4: Algorithms: The Brains of Machine Learning 177
4.1 Introduction ......................................................................................................................... 177
ML-Gate 3 ............................................................................................................................ 178
4.2 Algorithm Styles .................................................................................................................. 179
Labeled vs. Unlabeled Data ................................................................................................. 179
4.3 Supervised Learning ........................................................................................................... 180
4.4 Unsupervised Learning ....................................................................................................... 182
4.5 Semi-Supervised Learning ................................................................................................. 184
4.6 Alternative Learning Styles ................................................................................................. 185
Linear Regression Algorithm ............................................................................................... 185
Deep Learning Algorithms ................................................................................................... 186
Reinforcement Learning ...................................................................................................... 188
4.7 CML Algorithm Overview ..................................................................................................... 189
4.8 Choose the Right Algorithm ................................................................................................ 192
Functional Algorithm Decision Process ............................................................................... 193
Table of Contents
vii
4.9 The Seven Most Useful CML Algorithms ............................................................................. 195
Naive Bayes Algorithm (NB) ................................................................................................. 195
Random Forest Algorithm (RF) ............................................................................................. 197
K-Nearest Neighbors Algorithm (KNN) ................................................................................. 199
Support Vector Machine Algorithm (SVM) ............................................................................ 202
K-Means Algorithm .............................................................................................................. 204
DBSCAN Algorithm ............................................................................................................... 206
Expectation-Maximization (EM) Algorithm........................................................................... 208
4.10 Algorithm Performance ..................................................................................................... 209
MNIST Algorithm Evaluation ................................................................................................ 209
4.11 Algorithm Analysis ............................................................................................................ 214
Confusion Matrix ................................................................................................................. 215
ROC Curves .......................................................................................................................... 216
K-Fold Cross-Validation ....................................................................................................... 218
4.12 Java Source Code ............................................................................................................. 220
Classification Algorithms ..................................................................................................... 222
Clustering Algorithms .......................................................................................................... 223
Java Algorithm Modification ................................................................................................ 224
Chapter 5: Machine Learning Environments 227
5.1 Overview ............................................................................................................................. 228
ML Gates .............................................................................................................................. 228
5.2 Java ML Environments ........................................................................................................ 229
Weka .................................................................................................................................... 232
RapidMiner .......................................................................................................................... 232
KNIME .................................................................................................................................. 234
ELKI...................................................................................................................................... 236
Java-ML ............................................................................................................................... 236
5.3 Weka Installation ................................................................................................................. 236
Weka Configuration ............................................................................................................. 238
Java Parameters Setup ....................................................................................................... 241
Table of Contents
viii
Modifying Weka .prop Files ................................................................................................. 242
Weka Settings...................................................................................................................... 244
Weka Package Manager ...................................................................................................... 245
5.4 Weka Overview ................................................................................................................... 247
Weka Documentation .......................................................................................................... 249
Weka Explorer ..................................................................................................................... 249
Weka Filters ......................................................................................................................... 251
Weka Explorer Key Options ................................................................................................. 252
Weka KnowledgeFlow ......................................................................................................... 253
Weka Simple CLI .................................................................................................................. 255
5.5 Weka Clustering Algorithms ................................................................................................ 257
Clustering with DBSCAN ...................................................................................................... 257
Clustering with KnowledgeFlow .......................................................................................... 264
5.6 Weka Classification Algorithms ........................................................................................... 268
Preprocessing (Data Cleaning) ............................................................................................ 269
Classification: Random Forest Algorithm ............................................................................. 274
Classification: K-Nearest Neighbor ...................................................................................... 278
Classification: Naive Bayes .................................................................................................. 281
Classification: Support Vector Machine ............................................................................... 283
5.7 Weka Model Evaluation ....................................................................................................... 286
Multiple ROC Curves ............................................................................................................ 288
5.8 Weka Importing and Exporting ............................................................................................ 292
Chapter 6: Integrating Models 297
6.1 Introduction ......................................................................................................................... 297
6.2 Managing Models ................................................................................................................ 298
Device Constraints ............................................................................................................... 299
Optimal Model Size .............................................................................................................. 300
Model Version Control .......................................................................................................... 304
Updating Models .................................................................................................................. 305
Managing Models: Best Practices ....................................................................................... 307
Table of Contents
ix
6.3 Weka Java API ..................................................................................................................... 307
Loading Data ....................................................................................................................... 308
Working with Options .......................................................................................................... 309
Applying Filters .................................................................................................................... 309
Setting the Label Attribute ................................................................................................... 310
Building a Classifier ............................................................................................................. 310
Training and Testing ............................................................................................................. 311
Building a Clusterer ............................................................................................................. 312
Loading Models ................................................................................................................... 312
Making Predictions .............................................................................................................. 313
6.4 Weka for Android ................................................................................................................. 314
Creating Android Weka Libraries in Eclipse ......................................................................... 315
Adding the Weka Library in Android Studio ......................................................................... 320
6.5 Android Integration ............................................................................................................. 321
Project: Weka Model Create ................................................................................................. 322
Project: Weka Model Load ................................................................................................... 328
6.6 Android Weka Model Performance ...................................................................................... 335
6.7 Raspberry Pi Integration ..................................................................................................... 337
Raspberry Pi Setup for ML ................................................................................................... 339
Raspberry Pi GUI Considerations ......................................................................................... 341
Weka API Library for Raspberry Pi ....................................................................................... 342
Project: Raspberry Pi Old Faithful Geyser Classifier ............................................................ 342
6.8 Sensor Data ........................................................................................................................ 363
Android Sensors .................................................................................................................. 363
Raspberry Pi with Sensors .................................................................................................. 365
Sensor Units of Measure ..................................................................................................... 369
Project: Android Activity Tracker .......................................................................................... 370
6.9 Weka License Notes ............................................................................................................ 381
Index 383
Table of Contents
xi
About the Author
Mark Wickham is a frequent speaker at Android developer
conferences and has written two books, Practical Android
and Practical Java Machine Learning. As a freelance Android
developer, Mark currently resides in Dallas, TX after living
and working in China for nearly 20 years. While at Motorola,
Mark led product management, product marketing, and
software development teams in the Asia Pacific region.
Before joining Motorola, Mark worked on software projects
for TRW’s Space Systems Division. Mark has a degree in
Computer Science and Physics from Creighton University, and MBA from the University
of Washington, and jointly studied business at the Hong Kong University of Science
and Technology. In his free time, Mark also enjoys photography and recording live
music. Mark can be contacted via his LinkedIn profile (www.linkedin.com/in/mark-jwickham/) or GitHub page (www.github.com/wickapps).
xiii
About the Technical Reviewer
Jason Whitehorn is an experienced entrepreneur and
software developer. He has helped many oil and gas
companies automate and enhance their oilfield solutions
through field data capture, SCADA, and machine learning.
Jason obtained his Bachelor of Science in Computer Science
from Arkansas State University, but he traces his passion
for development back many years before then, having first
taught himself to program BASIC on his family’s computer
while still in middle school.
When he’s not mentoring and helping his team at work,
writing, or pursuing one of his many side projects, Jason enjoys spending time with his
wife and four children and living in the Tulsa, Oklahoma region. More information about
Jason can be found on his website at https://jason.whitehorn.us.
xv
Preface
It is interesting to watch trends in software development come and go, and to watch
languages become fashionable, and then just as quickly fade away. As machine learning
and AI began to reemerge a few years ago, it was easy to look upon the hype with a great
deal of skepticism.
• AlphaGo, a UK-based company, used deep learning to defeat the Go
masters. Go is a Chinese board game that very complicated due to a
huge number of combinations. Living in China at the time, there was
a lot of discussion about the panicked Go masters who refused to
play the machine for fear that their techniques would be exposed or
"learned" by the machines.
• An AI Poker Bot named Libratus individually defeated four top
human professional players in 2017. This was surprising because
poker is a difficult game for machines to master. In poker, unlike
Go, there is a lot of unknown information, making it an "imperfect
information" game.
• Machine traders are replacing human traders at many of the large
investment banks. The rise of the "quant" on Wall Street is well
documented. Examining the job opportunities at investment banks
reveals a trend favoring math majors, data scientists, and machine
learning experts.
• IBM's Watson can do amazing things, such as fix the elevator before
breaks, adjust the sprinkler system in the vineyard to optimize yield,
and help oilfield workers manage a drilling rig.
xvi
Despite the hype, it was not until confronted with problems that were very difficult
to solve with existing software tools that I began to explore and appreciate the power of
machine learning techniques.
Today, after several years of gaining an understanding about what these new
techniques can do, and how to apply them, I find myself thinking differently about each
problem I encounter. Almost every piece of software can benefit in some way from
machine learning techniques.
Developing machine learning software requires us to think differently about
problems, resulting in a new way to partition our development efforts. However, change
is good, and using machine learning with a data-driven development methodology can
allow us to solve previously unsolvable problems.
In this book, I will describe what I have discovered along my journey. I hope that it
can help you in your future software endeavors.
Objectives
The book will meet the following objectives:
• Introduce readers to the exciting developments in the AI subfield
of machine learning (ML). The book will summarize the types of
problems machine learning can solve. Without machine learning,
such solutions would be very difficult to accomplish.
• Help readers understand the importance of data as the critical input
for any machine learning solution, and how to identify, organize, and
architect the data required for ML. Strategies and techniques for the
visualization and preprocessing of data will also be covered using
available Java packages. The book will help readers who know Java to
become more proficient in data science.
• Explore how to deploy ML solutions in conjunction with cloud
service providers such as Google and Amazon.
• Focus exclusively on Java libraries and Java-based solutions for
ML. The book will NOT cover other popular ML languages such as
Python or C++.
Preface
xvii
• Focus on classic machine learning solutions. The book will not cover
implementations for deep learning, which use neural networks. Deep
learning is a topic that requires a complete text of its own for proper
exploration.
• Provide readers an overview of ML algorithms. Rather than cover
these algorithms from a mathematical viewpoint, the book will
present a practical review of the algorithms and explain to readers
which algorithm to select for a particular problem.
• Introduce readers to the most important Java-based ML platforms.
The book will provide a deep dive into the popular Weka Java
environments. The book will show readers how to port the latest
Weka version to Android.
• Java developers have the advantage of easily transitioning to the
Android Mobile platform. The book will show readers how to deploy
ML apps for Android devices using the Weka API.
• One of the fastest growing sources of data is sensor data. Embedded
devices often produce sensor data, enabling a significant opportunity
to deploy ML solutions for these devices. The book will show readers
how to implement ML solutions for sensor data using Java.
Audience
This book is intended for the following audiences:
• Developers looking to implement ML solutions for Java platforms
• Data scientists looking to explore Java implementation options
• Business decision makers looking to explore entry into machine
learning for their organizations
The book will be of most value to experienced Java developers who have not
implemented ML techniques before. The book will explain the various ML techniques
that are now feasible due to recent advances in performance, storage, and algorithms.
Preface