Big Data Imperatives

For your convenience Apress has placed some of the front

matter material after the index. Please use the Bookmarks

and Contents at a Glance links to access them.

iii

Contents at a Glance

Preface �� xiii

About the Authors�� xv

About the Technical Reviewer �� xvii

Acknowledgments�� xix

Introduction�� xxi

■Chapter 1: “Big Data” in the Enterprise�� 1

■Chapter 2: The New Information Management Paradigm�� 25

■Chapter 3: Big Data Implications for Industry�� 45

■Chapter 4: Emerging Database Landscape�� 73

■Chapter 5: Application Architectures for Big Data

and Analytics�� 107

■Chapter 6: Data Modeling Approaches for Big Data

and Analytics Solutions�� 155

■Chapter 7: Big Data Analytics Methodology �� 197

■Chapter 8: Extracting Value From Big Data: In-Memory Solutions,

Real Time Analytics, And Recommendation Systems �� 221

■Chapter 9: Data Scientist�� 251

Index�� 289

xxi

Introduction

You may be wondering—is this book for me? If you are seeking a textbook on Hadoop,

then clearly the answer is no. This book does not attempt to fully explain the theory and

derivation of the various algorithms and techniques behind products such as Hadoop.

Some familiarity with Hadoop techniques and related concepts, like NoSQL, is useful in

reading this book, but not assumed.

If you are developing, implementing, or managing modern, intelligent applications,

then the answer is yes. This book provides a practical rather than a theoretical treatment

of big data concepts, along with complete examples and recipes for solutions. It develops

some insights gleaned by experienced practitioners in the course of demonstrating how

big data analytics can be deployed to solve problems.

If you are a researcher in big data, analytics, and related areas, then the answer is

yes. Chances are, your biggest obstacle is translating new concepts into practice. This

book provides a few methodologies, frameworks, and collections of patterns from a

practical implementation perspective. This book can serve as a reference explaining how

you can leverage traditional data warehousing and BI architectures along with big data

technologies like Hadoop to develop big data solutions.

If you are client-facing and always in search of bright ideas to help seize business

opportunities, then the answer is yes, this book is also for you. Through real-world

examples, it will plant ideas about the many ways these techniques can be deployed.

It will also help your technical team jump directly to a cost-effective implementation

approach that can handle volumes of data previously only realistic for organizations with

large technology resources.

Roadmap

This book is broadly divided into three parts, covering concepts and industry-specific use

cases, Hadoop and NoSQL technologies, and methodologies and new skills like those of

the data scientist.

Part 1 consists of chapters 1 to 3. Chapter 1 introduces big data and its role in the

enterprise. This chapter will get you set up for all of the chapters that follow. Chapter 2

covers the need for a new information management paradigm. It explains why the

traditional approaches can’t handle the big data scale and what you need to do about

this. Chapter 3 discusses several industry use cases, bringing to life several interesting

implementation scenarios.

Part 2 consists of chapters 4 to 6. Chapter 4 presents the technology evolution,

explains the reason for NoSQL data bases, etc. Given that background, Chapter 5 presents

application architectures for implementing big data and analytics solutions. Chapter 6 then

gives you a first look at NoSQL data modeling techniques in a distributed environment.

■ Introduction

xxii

Part 3 of the book consists of chapters 7 to 9. Chapter 7 presents a methodology

for developing and implementing big data and analytics solutions. Chapter 8 discusses

several additional technologies like in-memory data grids and in-memory analytics.

Chapter 9 presents the need for a new breed of skills (a.k.a. “data scientist”), shows how

it is different from traditional data warehousing and BI skills, tells you what the key

characteristics are, and also covers the importance of data visualization techniques.

Chapter 1

“Big Data” in the Enterprise

Humans have been generating data for thousands of years. More recently we have seen

an amazing progression in the amount of data produced from the advent of mainframes

to client server to ERP and now everything digital. For years the overwhelming amount

of data produced was deemed useless. But data has always been an integral part of every

enterprise, big or small. As the importance and value of data to an enterprise became

evident, so did the proliferation of data silos within an enterprise. This data was primarily

of structured type, standardized and heavily governed (either through enterprise wide

programs or through business functions or IT), the typical volumes of data were in the

range of few terabytes and in some cases due to compliance and regulation requirements

the volumes expectedly went up several notches higher.

Big data is a combination of transactional data and interactive data. While

technologies have mastered the art of managing volumes of transaction data, it is the

interactive data that is adding variety and velocity characteristics to the ever-growing data

reservoir and subsequently poses significant challenges to enterprises.

Irrespective of how data is managed within an enterprise, if it is leveraged properly,

it can deliver immense business values. Figure 1-1 illustrates the value cycle of data,

from raw data to decision making. In the early 2000s, the acceptance of concepts like

Enterprise Data Warehouse (EDW), Business Intelligence (BI) and analytics, helped

enterprises to transform raw data collections into actionable wisdom. Analytics

applications such as customer analytics, financial analytics, risk analytics, product

analytics, health-care analytics became an integral part of the business applications

architecture of any enterprise. But all of these applications were dealing with only one

type of data: structured data.

CHAPTER 1 ■ “Big Data” in the Enterprise

The ubiquity of the Internet has dramatically changed the way enterprises function.

Essentially most every business became a “digital” business. The result was a data explosion.

New application paradigms such as web 2.0, social media applications, cloud computing,

and software-as-a-service applications further contributed to the data explosion. These new

application paradigms added several new dimensions to the very definition of data. Data

sources for an enterprise were no longer confined to data stores within the corporate firewalls

but also to what is available outside the firewalls. Companies such as LinkedIn, Facebook,

Twitter, and Netflix took advantage of these newer data sources to launch innovative product

offerings to millions of end users; a new business paradigm of “consumerism” was born.

Data regardless of type, location, and source increasingly has become a core business

asset for an enterprise and is now categorized as belonging to two camps: internal data

(enterprise application data) and external data (e.g., web data). With that, a new term has

emerged: big data. So, what is the definition of this all-encompassing arena called “big data”?

To start with, the definition of big data veers into 3Vs (exploding data volumes, data

getting generated at high velocity and data now offering more variety); however, if you

scan the Internet for a definition of big data, you will find many more interpretations.

There are also other interesting observations around big data: it is not only the 3Vs

that need to be considered, rather when the scale of data poses real challenges to the

traditional data management principles, it can then be considered a big data problem.

The heterogeneous nature of big data across multiple platforms and business functions

makes it difficult to be managed by following the traditional data management principles,

and there is no single platform or solution that has answers to all the questions related to

big data. On the other hand, there is still a vast trove of data within the enterprise firewalls

that is unused (or underused) because it has historically been too voluminous and/or raw

(i.e., minimally structured) to be exploited by conventional information systems, or too

costly or complex to integrate and exploit.

Big data is more a concept than a precise term. Some categorize big data as a volume

issue, only to petabyte-scale data collections (> one million GB); some associate big data

Figure 1-1. Transforming raw data into action-guiding wisdom

Collecting

Organizing

Summarizing

Analyzing

Synthesizing

Decision Making

Actionable

Insight

Knowledge

Information

Data

CHAPTER 1 ■ “Big Data” in the Enterprise

with the variety of data types even if the volume is in terabytes. These interpretations have

made big data issues situational.

The pervasiveness of the Internet has pushed generation and usage of data to

unprecedented levels. This aspect of digitization has taken a new meaning. The term

“data” is now expanding to cover events captured and stored in the form of text, numbers,

graphics, video, images, sound, and signals.

Table 1-1 illustrates the measures of scale of data.

Table 1-1. Measuring Big Data

1000 Gigabytes (GB) = 1 Terabyte (TB)

1000 Terabytes = 1 Petabyte (PB)

1000 Petabytes = 1 Exabyte (EB)

1000 Exabytes = 1 Zettabyte (ZB)

1000 Zettabytes = 1 Yottabyte (YB)

Is big data a new problem for enterprises? Not necessarily.

Big data has been of concern in few selected industries and scenarios for some time:

physical sciences (meteorology, physics), life sciences (genomics, biomedical research),

financial institutions (banking, insurance, and capital markets) and government (defense,

treasury). For these industries, big data was primarily a data volume problem, and to solve

these data-volume-related issues they had heavily relied on a mash-up of custom-developed

technologies and a set of complex programs to collect and manage the data. But, when doing

so, these industries and vendor products generally made the total cost of ownership (TCO) of

the IT infrastructure rise exponentially every year.

CIOs and CTOs have always grappled with dilemmas like how to lower IT costs to

manage the ever-increasing volumes of data, how to build systems that are scalable,

how to address performance-related concerns to meet business requirements that are

becoming increasingly global in scope and reach, how to manage data security, and

privacy and data-quality-related concerns. The polystructured nature of big data has

made the concerns increase in manifold ways: how does an industry effectively utilize

the poly-structured nature of data (structured data like database content, semi-structured

data like log files or XML files and unstructured content like text documents or web pages

or graphics) in a cost effective manner?

We have come a long way from the first mainframe era. Over the last few years,

technologies have evolved, and now we have solutions that can address some or all

of these concerns. Indeed a second mainframe wave is upon us to capture, analyze,

classify, and utilize the massive amount of data that can now be collected. There are

many instances where organizations, embracing new methodologies and technologies,

effectively leverage these poly-structured data reservoirs to innovate. Some of these

innovations are described below:

• Search at scale

• Multimedia content

• Sentiment analysis

CHAPTER 1 ■ “Big Data” in the Enterprise

• Enriching and contextualizing data

• Data discovery or exploratory analytics

• Operational analytics or embedded analytics

In this chapter, we will briefly discuss these use cases; there are several more such

use cases, which will be discussed in later chapters.

Search at Scale

In the early days of the Internet, search was primarily used to page through simple lists of

results, matching the search objective or key words. Search as a technology has evolved

immensely since then. Concepts like iteratively refining a search request by selecting

(or excluding) clusters or categories of results, parametric search and guided navigation,

type-ahead query suggestions, auto-spelling correction and fuzzy matching (matching via

synonyms, phonetics, and approximate spelling) have revolutionized effective means of

searching and navigating large volumes of information.

Using natural language processing (NLP) technologies and semantic analysis,

it is possible to automatically classify and categorize even big-data-size collections of

unstructured content; web search engines like Google, Yahoo!, and Bing are exploiting

these advances in technologies today.

Multimedia Content

Multimedia content is fascinating, as it consists of user-generated content like photos,

audio files, and videos. From a user perspective this content contains a lot of information:

e.g., where was the photo taken, when it was taken, what was the occasion, etc. But from

a technology perspective all this metadata needs to be manually tagged with the content

to make some meaning out of it, which is a daunting task. Analyzing and categorizing

images is an area of intense research. Exploiting this type of content at big data scale is

a real challenge. Recent technologies like automatic speech-to-text transcription and

object-recognition processing (Content-Based Image Retrieval, or CBIR) are enabling

us to structure this content in an automated fashion. If these technologies are used in an

industrialized fashion, significant impacts could be made in areas like medicine, media,

publishing, environmental science, forensics, and digital asset management.

Sentiment Analysis

Sentiment analysis technology is used to automatically discover, extract, and summarize

the context behind unstructured content. It helps in discovering sentiments and opinions

and polarity analysis concerning everything from ideas and issues to people, products,

and companies. The most cited use case of sentiment analysis is brand or reputation

analysis. The task entails collecting data from select web sources (industry sites, the

media, blogs, forums, social networks, etc.), cross-referencing this content with target

entities represented in internal systems (services, products, people, programs, etc.), and

extracting and summarizing the sentiments expressed in this cross-referenced content.

CHAPTER 1 ■ “Big Data” in the Enterprise

Companies have started leveraging sentiment analysis technology to understand the

voice of consumers and take timely actions such as the ones specified below:

• Monitoring and managing public perceptions of an issue, brand,

organization, etc. (called reputation monitoring)

• Analyzing reception of a new or revamped service or product

• Anticipating and responding to potential quality, pricing,

or compliance issues

• Identifying nascent market growth opportunities and trends in

customer demand

Enriching and Contextualizing Data

While it is a common understanding that there is a lot of noise in unstructured data, once

you are able to collect, analyze, and organize unstructured data, you can then potentially

use it to merge and cross-reference with your enterprise data to further enhance and

contextualize your existing structured data. There are already several examples of such

initiatives across companies where they have extracted information from high-volume

sources like chat, website logs, and social networks to enrich customer profiles in

a Customer Relationship Management (CRM) system. Using innovative approaches like

Facebook ID and Google ID, several companies have started to capture more details of

customers, thereby improving the quality of master data management.

Data Discovery or Exploratory Analytics

Data discovery or exploratory analytics is the process of analyzing data to discover something

that had not been previously noticed. It is a type of analytics that requires an open mind and

a healthy sense of curiosity to delve deep into data: the paths followed during analysis are in

no pre-determined patterns, and success is heavily dependent on the analyst’s curiosity as

they uncover one intriguing fact and then another, till they arrive at a final conclusion.

This process is in stark contrast to conventional analytics and Online Analytical

Processing (OLAP) analysis. In classic OLAP, the questions are pre-defined with additional

options to further drill down or drill across to get to the details of the data, but these activities

are still confined to finite sets of data and finite sets of questions. Since the activity is primarily

to confirm or refute hypotheses, classic OLAP is also sometimes referred to as Confirmatory

Data Analysis (CDA).

It is not uncommon for analysts cross-referencing individual and disconnected

collections of data sets during the exploratory analysis activity. For example, analysts at

Walmart cross-referenced big data collections of weather and sales data and discovered

that hurricane warnings trigger sales of not just flashlights and batteries (expected) but

also strawberry Pop Tarts breakfast pastries (not expected). And they also found that the

top-selling pre-hurricane item is beer (surprise again).

It is interesting to note that Walmart chanced upon this discovery not due to the

result of exploratory analytics (as is often reported), but due to conventional analytics.

CHAPTER 1 ■ “Big Data” in the Enterprise

In 2004, with hurricane Frances approaching, Walmart analysts analyzed their sales data

from their data warehouse; they were looking for any tell-tale signs of sales that happened

due to the recently passed hurricane Charley. They found beer and pastries were the

most-purchased items in a pre-hurricane timeframe, and they took action to increase

supplies of these products stores in Frances’s path.

The fascinating aspect of Walmart’s example is imagining what could happen if we

leverage machine-learning algorithms to discover such correlations in an automated way.

Operational Analytics or Embedded Analytics

While exploratory analytics are for discovery and strategies, operational analytics are to

deliver actionable intelligence on meaningful operational metrics in real or near-real

time. The realm of operational analytics is in the machine-generated data and

machine-to-machine interaction data. Companies (particularly in sectors like

telecommunications, logistics, transport, retailing, and manufacturing) are producing

real-time operational reporting and analytics based on such data and significantly

improving agility, operational visibility, and day-to-day decision making as a result.

Dr. Carolyn McGregor of the University of Ontario is using big data and analytics

technology to collect and analyze real-time streams of data like respiration, heart rate,

and blood pressure readings captured by medical equipment (with electrocardiograms

alone generating 1,000 readings per second) for early detection of potentially fatal

infections in premature babies.

Another fascinating example is in the home appliances area. Fridges can be

embedded with analytics modules that sense data from the various items kept in the

fridge. These modules give readings on things like expiry dates and calories and provides

timely alerts either to discard or avoid consuming the items.

Realizing Opportunities from Big Data

Big data is now more than a marketing term. Across industries, organizations are

assessing ways and means to make better business decisions utilizing such untapped

and plentiful information. That means as the big-data technologies evolve and more and

more business use cases come into the fray, the need for groundbreaking new approaches

to computing, both in hardware and software, are needed.

As enterprises look to innovate at a faster pace, launching innovative products and

improve customer services, they need to find better ways of managing and utilizing data

both within the internal and external firewalls. Organizations are realizing the need for

and the importance of scaling up their existing data management practices and adopting

newer information management paradigms to combat the perceived risk of reduced

business insight (while the volume of data is increasing rapidly, it is also posing an

interesting problem). So an organization’s ability to analyze that data to find meaningful

insights is becoming increasingly complex.

This is why analyst group IDC defines the type of technology needed to tackle big

data as: “A new generation of technologies and architectures, designed to economically

CHAPTER 1 ■ “Big Data” in the Enterprise

extract value from very large volumes of a wide variety of data, by enabling high-velocity

capture, discovery, and/or analysis.”

Big data technology and capability adoption across different enterprises is varied,

ranging from web 2.0 companies such as Google, LinkedIn, and Facebook (their business

being wholly dependent on these technologies) to Fortune 500 companies embarking on

pilot projects to evaluate how big data capability can co-exist with existing traditional data

management infrastructures. Many of the current success stories with big data have come

about with companies enabling analytic innovation and creating data services, embedding

a culture of innovation to create and propagate new database solutions, enhancing

existing solutions for data mining, implementing predictive analytics, and machine

learning techniques, complemented by the creation of new skills and roles such as data

scientists, big data architects, data visualization specialists, and data engineers leveraging

NoSQL products, among others. These enterprises’ experiences in the big data landscape

are characterized by the following categories: innovation, acceleration, and collaboration.

Innovation

Innovation is characterized by the usage of commodity hardware and distributed

processing, scalability through cloud computing and virtualization, and the impetus

to deploy NoSQL technologies as an alternative to relational databases. Open-source

solution offerings from Apache such as the Hadoop ecosystem are getting into

mainstream data management, with solution offerings from established companies such

as IBM, Oracle, and EMC, as well as upcoming startups such as Cloudera, HortonWorks,

and MapR. The development of big data platforms is perhaps the logical evolution

of this trend, resulting in a comprehensive solution across the access, integration,

storage, processing, and computing layers. Enterprises will continue to establish big

data management capabilities to scale utilization of these innovative offerings, realizing

growth in a cost- effective manner.

Acceleration

Enterprises across all industry domains are beginning to embrace the potential of big data

impacting core business processes. Upstream oil and gas companies collect and process

sensor data to drive real-time production operations, maintenance, and reliability

programs. Electronic health records, home health monitoring, tele-health, and new

medical imaging devices are driving a data deluge in a connected health world. Emerging

location-based data, group purchasing, and online leads allow retailers to continuously

listen, engage, and act on customer intent across the purchasing cycle. Mobile usage data

for telecom service providers unlock new business models and revenue streams from

outdoor ad placements.

The imperative for these enterprises is to assess their current Enterprise Information

Management (EIM) capabilities, adopt and integrate big data initiatives and embark on

programs to enhance their business capabilities and increased competitiveness.

CHAPTER 1 ■ “Big Data” in the Enterprise

Collaboration

Collaboration is the new trend in the big data scenario, whereby data assets are

commoditized, shared, and offered as a product of data services. Data democratization is

a leading motivator for this trend. Large data sets from academia, government, and even

space research are now available for the public to view, consume, and utilize in creative

ways. Data.gov is an example of a public service initiative where public data is shared and

has sparked similar initiatives across the globe. Big data use cases are reported in climate

modeling, political campaign strategy, poll predictions, environment management,

genetic engineering, space science, and other areas.

Data aggregators, data exchanges and data markets such as those from InfoChimps,

Factual, Microsoft Azure market place, Axciom and others have come up with data service

offerings whereby “trusted” data sets are made available for free or on a subscription basis.

This is an example where data sets are assessed with an inherent value as data products.

Crowdsourcing is a rapidly growing trend where skilled and passionate people

collaborate to develop innovative approaches to develop insights and recommendation

schemes. Kaggle offers a big data platform for predictive modeling and analytic

competitions effectively making “data science a sport.” Visual.ly offers one of the largest

data visualization showcases in the world, effectively exemplifying the collective talent

and creativity of a large user base.

The possibilities for new ideas and offerings will be forthcoming at a tremendous

rate in the coming years. As big data technologies mature and become easier to deploy

and use, expect to see more solutions coming out especially merging with the other areas

of cloud, mobile, and social media.

There is widespread awareness of the revenue and growth potential from enterprise

data assets. Data management is no longer seen as a cost center. Enterprise information

management is now perceived to be a critical initiative that can potentially impact the

bottom line. Data-driven companies can offer services like data democratization and data

monetization to launch new business models.

■ Note Data democratization, the sharing of data and making data available to anyone

that was once available only to a select few, is leading to creative usage of data such as

data mashups and enhanced data visualization. Data monetization (i.e., the business model

of offering data sets as a shareable commodity) has resulted in data service providers such

as data aggregators and data exchanges.

Big data analytics can thus enable new business opportunities from an operational

perspective. They provide effective utilization of data assets and rapid data insights into

business processes and enterprise applications and also enhanced analytical capabilities to

derive deeper meaningful insights in a rapid fashion, action on business strategies through

these enhanced insights into the business and exploitation of missed opportunities in areas

previously overlooked. These opportunities arise from the key premise in big data: all data

has potential value if it can be collected, analyzed, and used to generate actionable insight

and enhance operational business capabilities.

CHAPTER 1 ■ “Big Data” in the Enterprise

New Business Models

There is a growing awareness and realization that big data analytics platforms are enabling

new business models that were previously not possible or were difficult to realize.

Utilizing big data technologies and processes holds the promise for improving operational

efficiencies and generation of more revenues from new and/or enhanced sales channels.

Enterprises have already realized the benefits obtained by managing enterprise data

as an integral and core asset to manage their business and gain competitive advantage

from enhanced data utilization and insight.

Over the years, tremendous volumes of data have been generated. Many enterprises

have had the foresight not to discard these data and headed down the path to establish

enhanced analytical capabilities by leveraging large-scale transactional, interaction data

and lately social media data and machine-generated data. Even then, Forrester estimates

that only 1 to 1.5 percent of the available data is leveraged. Hence, there is the tantalizing

picture of all the business opportunities that can come about with increased utilization of

available data assets and newer ways of putting data to good use.

New Revenue Growth Opportunities

The big data age has enabled enterprises of all sizes ranging from startups to small business

and established large enterprises to utilize a new generation of processes and technologies.

In many instances the promise of overcoming the scalability and agility challenges of

traditional data management, coupled with the creative usage of data from multiple

sources, have enterprise stakeholders taking serious notice of their big data potential.

McKinsey’s analysis (summarized in Figure 1-2) indicates that big data has the

potential to add value across all industry segments. Companies likely to get the most out

of big data analytics include:

• Financial services: Capital markets generate large quantities of

stock market and banking transaction data that can help in fraud

detection, maximizing successful trades, etc.

• Supply chain, logistics, and manufacturing: With RFID sensors,

handheld scanners, and on-board GPS vehicle and shipment

tracking, logistics and manufacturing operations produce vast

quantities of information to aid in route optimization, cost

savings, and operational efficiency.

• Online services and web analytics: Firms can greatly benefit from

increasing their customer intelligence and using it for effective

cross-selling/up.

• Energy and utilities: “Smart grids” and electronic sensors

attached to machinery, oil pipelines and equipment generate

streams of incoming data that can be used for preventive means

to avoid disastrous failures.

• Media and telecommunications: Streaming media, smartphones,

tablets, browsing behavior and text messages aid in analyzing the

user interests and behavior and improve customer retention and

avoid churn.

CHAPTER 1 ■ “Big Data” in the Enterprise

When big data is distilled and analyzed in combination with traditional enterprise

data, enterprises can develop a more thorough and insightful understanding of their

• Health care and life sciences: Analyzing electronic medical records

systems in aiding optimum patient treatment options and analyzing

data for clinical studies can heavily influence both individual

patients’ care and public health management and policy.

• Retail and consumer products: Retailers can analyze vast

quantities of sales transaction data and understand the

buying behaviors, as well as make effective individual-focused

customized campaigns by analyzing social networking data.

Volume of

Data

Velocity of

Data

Variety

of Data

Under -Utilized

Data (‘Dark Data’)

Big Data Value

Potential

Banking and

Securities

High High Low Medium High

Communications

& Media

Services

High High High Medium High

Education Very Low Very Low Very Low High Medium

Government High Medium High High High

Healthcare

Providers

Medium High Medium Medium High

Insurance Medium Medium Medium Medium Medium

Manufacturing High High High High High

Chemicals &

Natural

Resources

High High High High Medium

Retail High High High Low High

Transportation Medium Medium Medium High Medium

Utilities Medium Medium Medium Medium Medium

Figure 1-2. Big data value across industries

Thư viện tri thức trực tuyến

Nội dung xem thử

Mô tả chi tiết

Tài liệu tương tự (6)

Big data

Big data và xu hướng ứng dụng trong hoạt động thông tin - thư viện

Big Data and Software Defined Networks

Big Data, Cloud Computing, Data Science & Engineering

Big Data Analytics

Big Data For Dummies