Data Warehousing and Data Mining

MSIT-116C:

Data Warehousing and

Data Mining

_____________________________________________________________

Course Design and Editorial Committee

Prof. M.G.Krishnan Prof. Vikram Raj Urs

Vice Chancellor Dean (Academic) & Convener

Karnataka State Open University Karnataka State Open University

Mukthagangotri, Mysore – 570 006 Mukthagangotri, Mysore – 570 006

Head of the Department and Course Co-Ordinator

Rashmi B.S

Assistant Professor & Chairperson

DoS in Information Technology

Karnataka State Open University

Mukthagangotri, Mysore – 570 006

Course Editor

Ms. Nandini H.M

Assistant Professor of Information Technology

DoS in Information Technology

Karnataka State Open University

Mukthagangotri, Mysore – 570 006

Course Writers

Dr. B. H. Shekar

Associate Professor

Department of Computer Science

Mangalagangothri

Mangalore

Dr. Manjaiah

Professor

Department of Computer Science

Mangalagangothri

Mangalore

Publisher

Registrar

Karnataka State Open University

Mukthagangotri, Mysore – 570 006

Developed by Academic Section, KSOU, Mysore

Karnataka State Open University, 2014

any other means, without permission in writing from the Karnataka State Open University.

Further information on the Karnataka State Open University Programmes may be obtained

from the University‘s Office at Mukthagangotri, Mysore – 6.

Printed and Published on behalf of Karnataka State Open University, Mysore-6 by the

Registrar (Administration)

Karnataka State Open University

Mukthagangothri, Mysore – 570 006

Third Semester M.Sc in Information Technology

MSIT-116C: Data Warehousing and Data Mining

Module 1

Unit-1 Basics of Data Mining and Data Warehousing 001-020

Unit-2 Data Warehouse and OLAP Technology: An Overview 021-060

Unit-3 Data Cubes and Implementation 061-083

Unit-4 Basics of Data Mining 084-102

Module 2

Unit-5 Frequent Patterns for Data Mining 103-117

Unit-6 FP Growth Algorithms 118-128

Unit-7 Classification and Prediction 129-138

Unit-8 Approaches for Classification 139-165

Module 3

Unit-9 Classification Techniques 166-191

Unit-10 Genetic Algorithms, Rough Set and Fuzzy Sets 192-212

Unit-11 Prediction Theory of Classifiers 213-236

Unit-12 Algorithms for Data Clustering 237-259

Module 4

Unit-13 Cluster Analysis 260-276

Unit-14 Spatial Data Mining 277-290

Unit-15 Text Mining 291-308

Unit-16 Multimedia Data Mining 309-334

PREFACE

The objective of data mining is to extract the relevant information from a large collection of

information. The large of amount of data exists due to advances in sensors, information technology,

and high-performance computing which is available in many scientific disciplines. These data sets are

not only very large, being measured in terabytes and peta bytes, but are also quite complex. This

complexity arises as the data are collected by different sensors, at different times, at different

frequencies, and at different resolutions. Further, the data are usually in the form of images or

meshes, and often have both a spatial and a temporal component. These data sets arise in diverse

fields such as astronomy, medical imaging, remote sensing, nondestructive testing, physics,

materials science, and bioinformatics. This increasing size and complexity of data in scientific

disciplines has resulted in a challenging problem. Many of the traditional techniques from

visualization and statistics that were used for the analysis of these data are no longer suitable.

Visualization techniques, even for moderate-sized data, are impractical due to their subjective

nature and human limitations in absorbing detail, while statistical techniques do not scale up to

massive data sets. As a result, much of the data collected are never even looked at, and the full

potential of our advanced data collecting capabilities is only partially realized.

Data mining is the process concerned with uncovering patterns, associations, anomalies, and

statistically significant structures in data. It is an iterative and interactive process involving data

preprocessing, search for patterns, and visualization and validation of the results. It is a

multidisciplinary field, borrowing and enhancing ideas from domains including image understanding,

statistics, machine learning, mathematical optimization, high-performance computing, information

retrieval, and computer vision. Data mining techniques hold the promise of assisting scientists and

engineers in the analysis of massive, complex data sets, enabling them to make scientific discoveries,

gain fundamental insights into the physical processes being studied, and advance their

understanding of the world around us.

We introduce basic concepts and models of Data Mining (DM) system from a computer science

perspective. The focus of the course will be on the study of different approaches for data mining,

models used in the design of DM system, search issues, text and multimedia data clustering

techniques. Different types of clustering and classification techniques are also discussed which find

applications in diversified fields. This course will empower the students to know how to design data

mining systems and in depth analysis is provided to design multimedia based data mining systems.

This concise text book provides an accessible introduction to data mining and organization that

supports a foundation or module course on data mining and data warehousing covering a broad

selection of the sub-disciplines within this field. The textbook presents concrete algorithms and

applications in the areas of business data processing, multimedia data processing, text mining etc.

Organization of the material: The book introduces its topics in ascending order of complexity and is

divided into four modules, containing four units each.

In the first module, we begin with an introduction to data mining highlighting its applications and

techniques. The basics of data mining and data warehousing concepts along with OLAP technology is

discussed in detail.

In the second module, we discussed the approaches to data mining. The frequent pattern mining

approach is presented in detail. The role of classification and association rule based classification is

also presented. We have also presented the prediction model of classification and different

approaches for classification.

The third module contains basics of soft computing paradigms such as fuzzy theory, rough sets and

genetic algorithms which are the basis for designing data mining algorithms. Algorithms of data

clustering are presented in this unit in detail which is central to any data mining techniques.

In the fourth module, metrics for cluster analysis are discussed. In addition, the data mining concept

for spatial data, textual data and multimedia data are presented in detail in this module.

Every module covers a distinct problem and includes a quick summary at the end, which can be used

as a reference material while reading data mining and data warehousing. Much of the material

found here is interesting as a view into how the data mining works, even if you do not need it for a

specific works.

Happy reading to all the students.

Structure

1.1 Objectives

1.2 Introduction

1.3 Data warehouse

1.4 Operational data store

1.5 Extraction transformation language

1.6 Data warehouse Meta data

1.7 Summary

1.8 Keywords

1.9 Exercises

1.10 References

1.1 Objectives

The objectives covered under this unit include:

 The introduction data mining and data warehousing

 Techniques for data mining

 Basics of operational data stores (ODS)

 Basics of Extraction transformation loading (ETL)

 Building the data warehouses

 Role of metadata.

1.2 Introduction

UNIT-1: BASICS OF DATA MINING AND DATA WAREHOUSING

What is data mining?

The amount of data on collected by organizations grows by leaps and bounds. The amount of

data is increasing year after year and there may be pay offs in uncovering hidden information

behind these data. Data mining is a way to gain market intelligence from this huge amount of

data. The problem today is not the lack of data, but how to learn from it. Data mining mainly

deals with structured data organized in a database. It uncovers anomalies, exceptions,

patterns, irregularities or trends that may otherwise remain undetected under the immense

volumes of data.

What is data warehousing?

A data warehouse is a database designed to support decision making in an organization. Data

from the production databases are copied to the data warehouse so that queries can be

performed without disturbing the performance or the stability of the production systems.

For data mining to occur, it is crucial that data warehousing is present.

An example of how well data warehousing and data mining has been utilized is Walmart.

Walmart maintains a 7.5 TB data warehouse. Retailers capture Point of Sale (POS)

transaction data from over 2,900 stores across 6 countries and transmit them to Walmart‘s

data warehouse. Walmart then allows their suppliers to access the data to collect information

on their products to analyse how they can improve their sales.

These suppliers will then better understand customer buying patterns and manage local store

inventory, etc.

Data mining techniques: What is it and how is it used?

Data mining is not a method of attacking the data; on the contrary, it is a way of teaming

from the data and then using that information. For that reason, we need a new mind-set in

data mining. We must be open to finding relationships and patterns that we never imagined

existed. We let data tell us the story rather than impose a model on the data that we feel will

replicate the actual patterns.

There are four categories of data mining techniques/tools (Keating, 2008):

1. Prediction

2. Classification

3. Clustering Analysis

4. Association Rules Discovery

Prediction Tools: They are the methods derived from traditional statistical forecasting for

predicting a variable‘s value. The most common and important applications in data mining

involves prediction. This technique involves traditional statistics such as regression analysis,

multiple discriminant analysis, etc. Non-traditional methods used in prediction tools are

Artificial Intelligence and Machine Learning.

Classification Tools: Most commonly used in data mining. Classification tools attempt to

distinguish different classes of objects or actions. For example, in a case of a credit card

transaction, these tools could classify it as one or the other. This will save the credit card

company a considerable amount of money.

Clustering Analysis Tools: These are very powerful tools for clustering products into groups

that naturally fall together. These groups are identified by the program and not by the

researchers. Most of the clusters discovered may not have little use in business decision.

However, one or two that are discovered may be extremely important and can be taken

advantage of to give the business an edge over its competitors. The most common use for

clustering tools is probably in what economists refer to as ―market segmentation.‖

Association Rules Discovery: Here the data mining tools discover associations; e.g., what

kinds of books certain groups of people read, what products certain groups of people

purchase, what movies certain groups of people watch, etc. Businesses can use this

information to target their markets. Online retailers like Netflix and Amazon use these tools

quite intensively. For example, Netflix recommends movies based on movies people have

watched and rated in the past. Amazon does something similar in recommending books when

you re-visit their website.

The two major pieces of software used at the moment for data mining are PASW Modeller

(formerly known as SPSS Clementine) and SAS Enterprise Miner. Both software packages

include an array of capabilities that enables data mining tools/ mentioned above. Newbies in

data mining can use an Excel add-in called XLMiner available from Resampling Stats, Inc.

This Excel add-in lets potential data miners not only examine the usefulness of such a

program but also get familiar with some of the data mining techniques. Although Excel is

quite limited in the number of observations it can handle, it can give the use a taste of how

valuable data mining can be – without expensing too much cost first.

Examples of use of information extracted from data mining exercises

Data mining has been used to help in credit scoring of customers in the financial industry

(Peng, 2004). Credit scoring can be defined as a technique that helps credit providers decide

whether to grant credit to customers. It‘s most common use is in making credit decisions for

loan applications. Credit scoring is also applied in decisions on personal loan applications –

the setting of credit limits, manage existing accounts and forecast the profitability of

consumers and customers (Punch, 2000).

Data mining and data warehousing has been particularly successful in the realm of customer

relationship management. By utilizing a data warehouse, retailers can embark on customerspecific strategies like customer profiling, customer segmentation, and cross-selling. By

using the information in the data warehouse, the business can divide its customers into four

quadrants of customer segmentation: (1) customers that should be eliminated (i.e., they cost

more than what they generate in revenues); (2) customers with whom the relationship should

be re-engineered (i.e., those that have the potential to be valuable, but may require the

company‘s encouragement, cooperation, and/ or management); (3) customers that the

company should engage; and (4) customers in which the company should in est (Buttle, 1999;

Verhoef & Donkers, 2001). The company then could use the corresponding strategies, to

manage the customer relationships (Cunningham et al, 2006)

Data mining can also help in the detection of spam in electronic mail (email) (Shih et al,

2008).

Data mining has also been used healthcare and acute care. A medical center in the US used

data mining technology to help its physicians work more efficiently and reduce mistakes

(Veluswamy, 2008).

There are other examples which we will not deal with here that have been flagship success

stories of data mining – the beer and diaper association; Harrah; Amazon and Netflix.

Essentials before you data mine

Apart from management buy in and financial backing, there are certain basics before you

embark on a data mining project. As data mining can only uncover patterns already present in

the data, the target dataset – you must already have the data and the data resides in a data

warehouse or a data mart — which must be large enough to contain these patterns while

remaining concise enough to be mined in an acceptable timeframe. The target set then needs

to be ―cleaned‖. This process removes the observations with noise and missing data. The

cleaned data is then reduced into feature vectors, one vector per observation. A feature vector

is a summarised version of the raw data observation.

Limitations of data mining

The quality of data mining applications depends on the quality and availability of data. As the

data set that needs to be mined should be of a certain quality, time and expense may be

needed to ―clean‖ the data that need to be mined.

Not to mention that the amount of data to be mined should be sufficiently large for the

software to extract meaningful patterns and association.

Also, as data mining requires huge amounts of resources – man hours, and financially — the

user must be a domain specialist and must understand business problems and be familiar with

data mining tools and techniques, so that resources are not wasted on a data mining project

that will fail at the start.

Also, once data have been mined, it is up to the management and decision makers to use the

information that has been extracted. Data mining is not the end all and the magic wand that

points the organization to what it should do. Human intellect and business acumen of the

decision makers is still very much required to make any sense out of the information that is

extracted from a data mining exercise.

Some issues surrounding data mining and data warehousing

1. You’ve data mined – do you think that the bosses will take the proper and appropriate

action – the dichotomy between use of sophisticated data mining software and techniques and

the conventionality of how organizations make decisions

Brydon and Gemino (2008) highlighted the dichotomy between the use of sophisticated data

mining software and techniques as opposed to the conventionality of how organisations make

decisions. They believed, rightly so, that ―tools and techniques for data mining and decision

making integration are still in their infancy. Firms must be willing to reconsider the ways in

which they make decisions if they are to realize a payoff from their investments in data

mining technology.‖

2. One size fits all data mining packages for industry. Does this fit the purpose of data mining

at all?

There are now available ―one size fits all‖ vertical applications for certain industries/ industry

segments developed by consultants. The consultants market these packages to all competitors

within that segment. This poses a potential risk for companies who are new to data mining as

when they explore the technique and these vertical ―off the shelf‖ solutions that their

competitors can also easily obtain.

Nevertheless, having said that the application of this technology is limited only by our

imagination, so that it is up to the companies to show and why they wish to use the

technology. They should also be aware of the fact that data mining is a long and resource

intensive exercise which an ―off the shelf‖ solution deceptively presents as easy and

affordable. Only companies that learn to be comfortable in utilising these tools on all

varieties of company data will benefit.

3. The use of data mining for prediction – use in non-commercial and ―problematic‖ areas.

E.g. prediction of terrorist acts

In 2002, the US government embarked on a massive data mining effort. Called the Total

Information Awareness The basic idea to collect as much data on everyone and sift this

through massive computers and investigate patterns that might indicate terrorist plots

(Schneier, 2006). However, a backlash of public opinion drove the US Congress to stop

funding the programme. Nevertheless, there is belief that the programme just changed its

name and moved inside the walls of the US Defence Department (Harris, 2006)

According to Schneier (2006), why data mining for use in such a situation will fail because

Terrorist plots are different from credit card fraud. Terrorist acts have no well-defined profile

and attacks are very rare. ―Taken together, these facts mean that data-mining systems won‘t

uncover any terrorist plots until they are very accurate, and that even very accurate

systems would be so flooded with false alarms that they will be useless.‖

This highlights the principle pointed earlier on in this paper – data mining is not a panacea of

all information problems and is not a magic wand to guide anyone out of the wilderness.

4. Ethical concerns over data warehousing and data mining – do you have any? Should

companies be concerned?

Data mining produces results only if it works with higher volumes of information at its

disposal. With the higher amounts of data that needs to be gathered, should we also be

concerned with the ethics behind the collection and use of that data.

As highlighted by Linstedt (2004), the implementers of the technology are simply told to

integrate data and the project manager builds a project to make it happen – these people

simply do not have the time to ponder whether the data had been handled ethically. Linstedt

proposes a checklist for project managers and technology implementers to address ethical

concerns over data:

 Develop SLA‘s with end users that define who has access to what levels of

information

 Have end-users involved in defining the ethical standards of use for the data that will

be delivered.

 Define the bounds around the integration efforts of public data, where it will be

integrated and where it will not – so as to avoid conflicts of interest.

 Do not use ―live‖ or real data for testing purposes – or lock down the test

environment; too often test environments are left wide-open and accessible to too

many individuals.

 Define where, how, and who will be using Data Mining – restrict the mining efforts to

specific sets of information. Build a notification system to monitor data mining usage.

 Allow customers to ―block‖ the integration of their own information (this one is

questionable) depending on if the customer information after integration will be made

available on the web.

 Remember that any efforts made are still subject to governmental laws.

 Nothing is sacred. If a government wants access to the information, they will get it.

1.3 Data warehouse

In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is

a database used for reporting (1) and data analysis (2). Integrating data from one or more

disparate sources creates a central repository of data, a data warehouse (DW). Data

warehouses store current and historical data and are used for creating trending reports for

senior management reporting such as annual and quarterly comparisons.

The data stored in the warehouse is uploaded from the operational systems (such as

marketing, sales, etc., shown in the figure to the right). The data may pass through an

operational data store for additional operations before it is used in the DW for reporting.

The typical extract transform load (ETL)-based data warehouse uses staging, data integration,

and access layers to house its key functions. The staging layer or staging database stores raw

data extracted from each of the disparate source data systems. The integration layer integrates

the disparate data sets by transforming the data from the staging layer often storing this

transformed data in an operational data store (ODS) database. The integrated data are then

moved to yet another database, often called the data warehouse database, where the data is

arranged into hierarchical groups often called dimensions and into facts and aggregate facts.

The combination of facts and dimensions is sometimes called a star schema. The access layer

helps users retrieve data.[1]

A data warehouse constructed from integrated data source systems does not require ETL,

staging databases, or operational data store databases. The integrated data source systems

may be considered to be a part of a distributed operational data store layer. Data federation

methods or data virtualization methods may be used to access the distributed integrated

source data systems to consolidate and aggregate data directly into the data warehouse

database tables.

Unlike the ETL-based data warehouse, the integrated source data systems and the data

warehouse are all integrated since there is no transformation of dimensional or reference data.

This integrated data warehouse architecture supports the drill down from the aggregate data

of the data warehouse to the transactional data of the integrated source data systems.

A data mart is a small data warehouse focused on a specific area of interest. Data warehouses

can be subdivided into data marts for improved performance and ease of use within that area.

Alternatively, an organization can create one or more data marts as first steps towards a larger

and more complex enterprise data warehouse.

This definition of the data warehouse focuses on data storage. The main source of the data is

cleaned, transformed, catalogued and made available for use by managers and other business

professionals for data mining, online analytical processing, market research and decision

support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to

extract, transform and load data, and to manage the data dictionary are also considered

essential components of a data warehousing system. Many references to data warehousing

use this broader context. Thus, an expanded definition for data warehousing includes business

intelligence tools, tools to extract, transform and load data into the repository, and tools to

manage and retrieve metadata.

Difficulties of Implementing Data Warehouses

Some significant operational issues arise with data warehousing: construction, administration,

and quality control. Project management—the design, construction, and implementation of

Thư viện tri thức trực tuyến

Data Warehousing and Data Mining

Nội dung xem thử

Mô tả chi tiết

Tài liệu tương tự (6)

Data Warehousing and Knowledge Discovery

Data warehousing and data mining - a case study

Encyclopedia of data warehousing and mining

(TIỂU LUẬN) data warehousing and business intelligence báo cáo bài tập lớn đề tài healthcare

cơ sở dữ liệu nguyễn trung trực elmasri 6e chương 29 overview of data warehousing and olập

data warehousing architecture andimplementation phần 9 pps