Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Data Warehousing in the Age of Big Data
Nội dung xem thử
Mô tả chi tiết
Data Warehousing in the
Age of Big Data
This page intentionally left blank
Data Warehousing in the
Age of Big Data
Krish Krishnan
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier
Acquiring Editor: Andrea Dierna
Development Editor: Heather Scherer
Project Manager: Punithavathy Govindaradjane
Designer: Maria Inês Cruz
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA, 02451, USA
Copyright © 2013 Elsevier Inc. All rights reserved
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including
photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher.
Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with
organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website:
www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be
noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding,
changes in research methods or professional practices, may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information
or methods described herein. In using such information or methods they should be mindful of their own safety and the safety
of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any
injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or
operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
Krishnan, Krish.
Data warehousing in the age of big data / Krish Krishnan.
pages cm
Includes bibliographical references and index.
ISBN 978-0-12-405891-0 (pbk.)
1. Data warehousing. 2. Big data. I. Title.
QA76.9.D37K75 2013
005.74'5—dc23
2013004151
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Printed and bound in the United States of America
13 14 15 16 17 10 9 8 7 6 5 4 3 2 1
For information on all MK publications visit our website at www.mkp.com
This book is dedicated to
William Harvey Inmon, a dear friend, mentor, teacher, advisor, and business
partner—you are an inspiration for generations to come.
My wonderful wife and our sons, who are my source of motivation and
inspiration—without your unquestioning support, chasing my dreams
would have been dreams.
This page intentionally left blank
vii
Contents
Acknowledgments.................................................................................................................................xv
About the Author.................................................................................................................................xvii
Introduction..........................................................................................................................................xix
PART 1 BIG DATA
CHAPTER 1 Introduction to Big Data ................................................................ 3
Introduction..................................................................................................................3
Big Data .......................................................................................................................3
Defining Big Data ........................................................................................................5
Why Big Data and why now? ......................................................................................5
Big Data example.........................................................................................................6
Social Media posts..................................................................................................6
Survey data analysis................................................................................................7
Survey data..............................................................................................................8
Weather data..........................................................................................................11
Twitter data ...........................................................................................................11
Integration and analysis ........................................................................................11
Additional data types ............................................................................................13
Summary....................................................................................................................14
Further reading...........................................................................................................14
CHAPTER 2 Working with Big Data ................................................................ 15
Introduction................................................................................................................15
Data explosion ...........................................................................................................15
Data volume...............................................................................................................17
Machine data.........................................................................................................17
Application log......................................................................................................18
Clickstream logs....................................................................................................18
External or third-party data...................................................................................18
Emails ...................................................................................................................18
Contracts...............................................................................................................19
Geographic information systems and geo-spatial data .........................................19
Example: Funshots, Inc.........................................................................................21
Data velocity..............................................................................................................23
Amazon, Facebook, Yahoo, and Google ...............................................................24
Sensor data ............................................................................................................24
Mobile networks ...................................................................................................24
Social media..........................................................................................................24
viii Contents
Data variety................................................................................................................25
Summary....................................................................................................................27
CHAPTER 3 Big Data Processing Architectures............................................... 29
Introduction................................................................................................................29
Data processing revisited...........................................................................................29
Data processing techniques........................................................................................30
Data processing infrastructure challenges.................................................................31
Storage ..................................................................................................................31
Transportation .......................................................................................................32
Processing .............................................................................................................32
Speed or throughput..............................................................................................33
Shared-everything and shared-nothing architectures.................................................33
Shared-everything architecture .............................................................................34
Shared-nothing architecture ..................................................................................34
OLTP versus data warehousing.............................................................................35
Big Data processing...................................................................................................36
Infrastructure explained ........................................................................................39
Data processing explained ....................................................................................40
Telco Big Data study..................................................................................................40
Infrastructure.........................................................................................................42
Data processing.....................................................................................................42
CHAPTER 4 Introducing Big Data Technologies .............................................. 45
Introduction................................................................................................................45
Distributed data processing........................................................................................46
Big Data processing requirements.............................................................................49
Technologies for Big Data processing.......................................................................50
Google file system.................................................................................................51
Hadoop.......................................................................................................................53
Hadoop core components......................................................................................54
Hadoop summary..................................................................................................85
NoSQL.......................................................................................................................86
CAP theorem.........................................................................................................87
Key-value pair: Voldemort ....................................................................................88
Column family store: Cassandra...........................................................................88
Document database: Riak .....................................................................................96
Graph databases....................................................................................................97
NoSQL summary ..................................................................................................97
Textual ETL processing.............................................................................................97
Further reading...........................................................................................................99
Contents ix
CHAPTER 5 Big Data Driving Business Value................................................ 101
Introduction..............................................................................................................101
Case study 1: Sensor data ........................................................................................102
Summary.............................................................................................................102
Vestas ..................................................................................................................102
Overview.............................................................................................................102
Producing electricity from wind .........................................................................102
Turning climate into capital ................................................................................104
Tackling Big Data challenges .............................................................................104
Maintaining energy efficiency in its data center .................................................105
Case study 2: Streaming data...................................................................................105
Summary.............................................................................................................105
Surveillance and security: TerraEchos................................................................105
The need..............................................................................................................106
The solution ........................................................................................................106
The benefit ..........................................................................................................106
Advanced fiber optics combine with real-time streaming data...........................107
Solution components...........................................................................................107
Extending the security perimeter creates a strategic advantage..........................107
Correlating sensor data delivers a zero false-positive rate..................................107
Case study 3: The right prescription: improving patient outcomes
with Big Data analytics........................................................................................108
Summary.............................................................................................................108
Business objective...............................................................................................108
Challenges...........................................................................................................108
Overview: giving practitioners new insights to guide patient care .....................109
Challenges: blending traditional data warehouse ecosystems with Big Data.....109
Solution: getting ready for Big Data analytics....................................................109
Results: eliminating the “Data Trap” ..................................................................110
Why aster? ..........................................................................................................110
About aurora .......................................................................................................111
Case study 4: University of Ontario, institute of technology: leveraging
key data to provide proactive patient care ...............................................................111
Summary.............................................................................................................111
Overview.............................................................................................................111
Business benefits.................................................................................................112
Making better use of the data resource ...............................................................112
Smarter healthcare ..............................................................................................113
Solution components...........................................................................................113
Merging human knowledge and technology.......................................................114
x Contents
Broadening the impact of artemis.......................................................................115
Case study 5: Microsoft SQL server customer solution ..........................................115
Customer profile..................................................................................................115
Solution spotlight................................................................................................115
Business needs....................................................................................................116
Solution...............................................................................................................116
Benefits ...............................................................................................................117
Case study 6: Customer-centric data integration .....................................................118
Overview.............................................................................................................118
Solution design....................................................................................................121
Enabling a better cross-sell and upsell opportunity ............................................121
Summary..................................................................................................................123
PART 2 THE DATA WAREHOUSING
CHAPTER 6 Data Warehousing Revisited...................................................... 127
Introduction..............................................................................................................127
Traditional data warehousing, or data warehousing 1.0 ..........................................128
Data architecture .................................................................................................129
Infrastructure.......................................................................................................130
Pitfalls of data warehousing................................................................................131
Architecture approaches to building a data warehouse.......................................137
Data warehouse 2.0..................................................................................................140
Overview of Inmon’s DW 2.0.............................................................................141
Overview of DSS 2.0 ..........................................................................................141
Summary..................................................................................................................144
Further reading.........................................................................................................144
CHAPTER 7 Reengineering the Data Warehouse ........................................... 147
Introduction..............................................................................................................147
Enterprise data warehouse platform ........................................................................148
Transactional systems.........................................................................................149
Operational data store .........................................................................................149
Staging area.........................................................................................................149
Data warehouse...................................................................................................149
Datamarts............................................................................................................150
Analytical databases............................................................................................150
Issues with the data warehouse ...........................................................................150
Choices for reengineering the data warehouse ........................................................152
Replatforming .....................................................................................................152
Platform engineering...........................................................................................153
Data engineering .................................................................................................154
Contents xi
Modernizing the data warehouse .............................................................................155
Case study of data warehouse modernization..........................................................157
Current-state analysis..........................................................................................157
Recommendations...............................................................................................158
Business benefits of modernization ....................................................................158
The appliance selection process..........................................................................159
Summary..................................................................................................................162
CHAPTER 8 Workload Management in the Data Warehouse........................... 163
Introduction..............................................................................................................163
Current state.............................................................................................................163
Defining workloads..................................................................................................164
Understanding workloads........................................................................................165
Data warehouse outbound...................................................................................167
Data warehouse inbound.....................................................................................169
Query classification .................................................................................................170
Wide/Wide ..........................................................................................................170
Wide/Narrow.......................................................................................................171
Narrow/Wide.......................................................................................................171
Narrow/Narrow ...................................................................................................172
Unstructured/semi-structured data ......................................................................172
ETL and CDC workloads ........................................................................................172
Measurement............................................................................................................174
Current system design limitations ...........................................................................175
New workloads and Big Data ..................................................................................176
Big Data workloads.............................................................................................176
Technology choices..................................................................................................177
Summary..................................................................................................................178
CHAPTER 9 New Technologies Applied to Data Warehousing ........................ 179
Introduction..............................................................................................................179
Data warehouse challenges revisited .......................................................................179
Data loading........................................................................................................180
Availability..........................................................................................................180
Data volumes.......................................................................................................180
Storage performance ...........................................................................................181
Query performance .............................................................................................181
Data transport......................................................................................................181
Data warehouse appliance .......................................................................................182
Appliance architecture ........................................................................................183
Data distribution in the appliance .......................................................................184
Key best practices for deploying a data warehouse appliance............................186
xii Contents
Big Data appliances ............................................................................................187
Cloud computing .....................................................................................................187
Infrastructure as a service ...................................................................................188
Platform as a service ...........................................................................................188
Software as a service...........................................................................................189
Cloud infrastructure ............................................................................................189
Benefits of cloud computing for data warehouse................................................190
Issues facing cloud computing for data warehouse ............................................190
Data virtualization ...................................................................................................191
What is data virtualization? ................................................................................191
Increasing business intelligence performance.....................................................193
Workload distribution .........................................................................................193
Implementing a data virtualization program.......................................................193
Pitfalls to avoid when using data virtualization ..................................................194
In-memory technologies .....................................................................................194
Benefits of in-memory architectures...................................................................195
Summary..................................................................................................................195
Further reading.........................................................................................................195
PART 3 BUILDING THE BIG DATA – DATA WAREHOUSE
CHAPTER 10 Integration of Big Data and Data Warehousing............................ 199
Introduction..............................................................................................................199
Components of the new data warehouse..................................................................200
Data layer............................................................................................................200
Algorithms..........................................................................................................202
Technology layer.................................................................................................203
Integration strategies................................................................................................204
Data-driven integration .......................................................................................204
Physical component integration and architecture ...............................................207
External data integration .....................................................................................209
Hadoop & RDBMS..................................................................................................211
Big Data appliances.................................................................................................212
Data virtualization ...................................................................................................214
Semantic framework ................................................................................................215
Lexical processing ..............................................................................................216
Clustering............................................................................................................216
Semantic knowledge processing .........................................................................217
Information extraction ........................................................................................217
Visualization .......................................................................................................217
Summary..................................................................................................................217
Contents xiii
CHAPTER 11 Data-Driven Architecture for Big Data ........................................ 219
Introduction..............................................................................................................219
Metadata ..................................................................................................................219
Technical metadata..............................................................................................221
Business metadata...............................................................................................221
Contextual metadata............................................................................................221
Process design–level metadata............................................................................221
Program-level metadata ......................................................................................222
Infrastructure metadata .......................................................................................222
Core business metadata.......................................................................................222
Operational metadata ..........................................................................................223
Business intelligence metadata ...........................................................................223
Master data management .........................................................................................223
Processing data in the data warehouse.....................................................................225
Processing complexity of Big Data .........................................................................228
Processing limitations.........................................................................................229
Processing Big Data............................................................................................229
Machine learning .....................................................................................................235
Summary..................................................................................................................240
CHAPTER 12 Information Management and Life Cycle for Big Data .................. 241
Introduction..............................................................................................................241
Information life-cycle management.........................................................................241
Goals...................................................................................................................242
Information management policies.......................................................................243
Governance .........................................................................................................243
Benefits of information life-cycle management..................................................247
Information life-cycle management for Big Data....................................................247
Example: information life-cycle management and social media data ................248
Measuring the impact of information life-cycle management............................250
Summary..................................................................................................................250
CHAPTER 13 Big Data Analytics, Visualization, and Data Scientists ................ 251
Introduction..............................................................................................................251
Big Data analytics....................................................................................................251
Data discovery .........................................................................................................253
Visualization ............................................................................................................254
The evolving role of data scientists .........................................................................255
Summary..................................................................................................................255
xiv Contents
CHAPTER 14 Implementing the Big Data – Data Warehouse –
Real-Life Situations .................................................................. 257
Introduction: building the Big Data – Data Warehouse..........................................257
Customer-centric business transformation...............................................................257
Outcomes ............................................................................................................260
Hadoop and MySQL drives innovation ...................................................................261
Benefits ...............................................................................................................263
Integrating Big Data into the data warehouse..........................................................264
Empowering decision making.............................................................................264
Outcomes ............................................................................................................265
Summary..................................................................................................................265
Appendix A: Customer Case Studies..................................................................................................267
Appendix B: Building the Healthcare Information Factory................................................................289
Summary.............................................................................................................................................333
Index ...................................................................................................................................................335