Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Data Warehousing and Data Mining
Nội dung xem thử
Mô tả chi tiết
1
2
MSIT-116C:
Data Warehousing and
Data Mining
3
_____________________________________________________________
Course Design and Editorial Committee
Prof. M.G.Krishnan Prof. Vikram Raj Urs
Vice Chancellor Dean (Academic) & Convener
Karnataka State Open University Karnataka State Open University
Mukthagangotri, Mysore – 570 006 Mukthagangotri, Mysore – 570 006
Head of the Department and Course Co-Ordinator
Rashmi B.S
Assistant Professor & Chairperson
DoS in Information Technology
Karnataka State Open University
Mukthagangotri, Mysore – 570 006
Course Editor
Ms. Nandini H.M
Assistant Professor of Information Technology
DoS in Information Technology
Karnataka State Open University
Mukthagangotri, Mysore – 570 006
Course Writers
Dr. B. H. Shekar
Associate Professor
Department of Computer Science
Mangalagangothri
Mangalore
Dr. Manjaiah
Professor
Department of Computer Science
Mangalagangothri
Mangalore
Publisher
Registrar
Karnataka State Open University
Mukthagangotri, Mysore – 570 006
Developed by Academic Section, KSOU, Mysore
Karnataka State Open University, 2014
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or
any other means, without permission in writing from the Karnataka State Open University.
Further information on the Karnataka State Open University Programmes may be obtained
from the University‘s Office at Mukthagangotri, Mysore – 6.
Printed and Published on behalf of Karnataka State Open University, Mysore-6 by the
Registrar (Administration)
4
Karnataka State Open University
Mukthagangothri, Mysore – 570 006
Third Semester M.Sc in Information Technology
MSIT-116C: Data Warehousing and Data Mining
Module 1
Unit-1 Basics of Data Mining and Data Warehousing 001-020
Unit-2 Data Warehouse and OLAP Technology: An Overview 021-060
Unit-3 Data Cubes and Implementation 061-083
Unit-4 Basics of Data Mining 084-102
Module 2
Unit-5 Frequent Patterns for Data Mining 103-117
Unit-6 FP Growth Algorithms 118-128
Unit-7 Classification and Prediction 129-138
Unit-8 Approaches for Classification 139-165
5
Module 3
Unit-9 Classification Techniques 166-191
Unit-10 Genetic Algorithms, Rough Set and Fuzzy Sets 192-212
Unit-11 Prediction Theory of Classifiers 213-236
Unit-12 Algorithms for Data Clustering 237-259
Module 4
Unit-13 Cluster Analysis 260-276
Unit-14 Spatial Data Mining 277-290
Unit-15 Text Mining 291-308
Unit-16 Multimedia Data Mining 309-334
6
PREFACE
The objective of data mining is to extract the relevant information from a large collection of
information. The large of amount of data exists due to advances in sensors, information technology,
and high-performance computing which is available in many scientific disciplines. These data sets are
not only very large, being measured in terabytes and peta bytes, but are also quite complex. This
complexity arises as the data are collected by different sensors, at different times, at different
frequencies, and at different resolutions. Further, the data are usually in the form of images or
meshes, and often have both a spatial and a temporal component. These data sets arise in diverse
fields such as astronomy, medical imaging, remote sensing, nondestructive testing, physics,
materials science, and bioinformatics. This increasing size and complexity of data in scientific
disciplines has resulted in a challenging problem. Many of the traditional techniques from
visualization and statistics that were used for the analysis of these data are no longer suitable.
Visualization techniques, even for moderate-sized data, are impractical due to their subjective
nature and human limitations in absorbing detail, while statistical techniques do not scale up to
massive data sets. As a result, much of the data collected are never even looked at, and the full
potential of our advanced data collecting capabilities is only partially realized.
Data mining is the process concerned with uncovering patterns, associations, anomalies, and
statistically significant structures in data. It is an iterative and interactive process involving data
preprocessing, search for patterns, and visualization and validation of the results. It is a
multidisciplinary field, borrowing and enhancing ideas from domains including image understanding,
statistics, machine learning, mathematical optimization, high-performance computing, information
retrieval, and computer vision. Data mining techniques hold the promise of assisting scientists and
engineers in the analysis of massive, complex data sets, enabling them to make scientific discoveries,
gain fundamental insights into the physical processes being studied, and advance their
understanding of the world around us.
We introduce basic concepts and models of Data Mining (DM) system from a computer science
perspective. The focus of the course will be on the study of different approaches for data mining,
models used in the design of DM system, search issues, text and multimedia data clustering
techniques. Different types of clustering and classification techniques are also discussed which find
applications in diversified fields. This course will empower the students to know how to design data
mining systems and in depth analysis is provided to design multimedia based data mining systems.
This concise text book provides an accessible introduction to data mining and organization that
supports a foundation or module course on data mining and data warehousing covering a broad
7
selection of the sub-disciplines within this field. The textbook presents concrete algorithms and
applications in the areas of business data processing, multimedia data processing, text mining etc.
Organization of the material: The book introduces its topics in ascending order of complexity and is
divided into four modules, containing four units each.
In the first module, we begin with an introduction to data mining highlighting its applications and
techniques. The basics of data mining and data warehousing concepts along with OLAP technology is
discussed in detail.
In the second module, we discussed the approaches to data mining. The frequent pattern mining
approach is presented in detail. The role of classification and association rule based classification is
also presented. We have also presented the prediction model of classification and different
approaches for classification.
The third module contains basics of soft computing paradigms such as fuzzy theory, rough sets and
genetic algorithms which are the basis for designing data mining algorithms. Algorithms of data
clustering are presented in this unit in detail which is central to any data mining techniques.
In the fourth module, metrics for cluster analysis are discussed. In addition, the data mining concept
for spatial data, textual data and multimedia data are presented in detail in this module.
Every module covers a distinct problem and includes a quick summary at the end, which can be used
as a reference material while reading data mining and data warehousing. Much of the material
found here is interesting as a view into how the data mining works, even if you do not need it for a
specific works.
Happy reading to all the students.
8
Structure
1.1 Objectives
1.2 Introduction
1.3 Data warehouse
1.4 Operational data store
1.5 Extraction transformation language
1.6 Data warehouse Meta data
1.7 Summary
1.8 Keywords
1.9 Exercises
1.10 References
1.1 Objectives
The objectives covered under this unit include:
The introduction data mining and data warehousing
Techniques for data mining
Basics of operational data stores (ODS)
Basics of Extraction transformation loading (ETL)
Building the data warehouses
Role of metadata.
1.2 Introduction
UNIT-1: BASICS OF DATA MINING AND DATA WAREHOUSING
9
What is data mining?
The amount of data on collected by organizations grows by leaps and bounds. The amount of
data is increasing year after year and there may be pay offs in uncovering hidden information
behind these data. Data mining is a way to gain market intelligence from this huge amount of
data. The problem today is not the lack of data, but how to learn from it. Data mining mainly
deals with structured data organized in a database. It uncovers anomalies, exceptions,
patterns, irregularities or trends that may otherwise remain undetected under the immense
volumes of data.
What is data warehousing?
A data warehouse is a database designed to support decision making in an organization. Data
from the production databases are copied to the data warehouse so that queries can be
performed without disturbing the performance or the stability of the production systems.
For data mining to occur, it is crucial that data warehousing is present.
An example of how well data warehousing and data mining has been utilized is Walmart.
Walmart maintains a 7.5 TB data warehouse. Retailers capture Point of Sale (POS)
transaction data from over 2,900 stores across 6 countries and transmit them to Walmart‘s
data warehouse. Walmart then allows their suppliers to access the data to collect information
on their products to analyse how they can improve their sales.
These suppliers will then better understand customer buying patterns and manage local store
inventory, etc.
Data mining techniques: What is it and how is it used?
Data mining is not a method of attacking the data; on the contrary, it is a way of teaming
from the data and then using that information. For that reason, we need a new mind-set in
data mining. We must be open to finding relationships and patterns that we never imagined
existed. We let data tell us the story rather than impose a model on the data that we feel will
replicate the actual patterns.
There are four categories of data mining techniques/tools (Keating, 2008):
1. Prediction
2. Classification
3. Clustering Analysis
4. Association Rules Discovery
Prediction Tools: They are the methods derived from traditional statistical forecasting for
predicting a variable‘s value. The most common and important applications in data mining
involves prediction. This technique involves traditional statistics such as regression analysis,
10
multiple discriminant analysis, etc. Non-traditional methods used in prediction tools are
Artificial Intelligence and Machine Learning.
Classification Tools: Most commonly used in data mining. Classification tools attempt to
distinguish different classes of objects or actions. For example, in a case of a credit card
transaction, these tools could classify it as one or the other. This will save the credit card
company a considerable amount of money.
Clustering Analysis Tools: These are very powerful tools for clustering products into groups
that naturally fall together. These groups are identified by the program and not by the
researchers. Most of the clusters discovered may not have little use in business decision.
However, one or two that are discovered may be extremely important and can be taken
advantage of to give the business an edge over its competitors. The most common use for
clustering tools is probably in what economists refer to as ―market segmentation.‖
Association Rules Discovery: Here the data mining tools discover associations; e.g., what
kinds of books certain groups of people read, what products certain groups of people
purchase, what movies certain groups of people watch, etc. Businesses can use this
information to target their markets. Online retailers like Netflix and Amazon use these tools
quite intensively. For example, Netflix recommends movies based on movies people have
watched and rated in the past. Amazon does something similar in recommending books when
you re-visit their website.
The two major pieces of software used at the moment for data mining are PASW Modeller
(formerly known as SPSS Clementine) and SAS Enterprise Miner. Both software packages
include an array of capabilities that enables data mining tools/ mentioned above. Newbies in
data mining can use an Excel add-in called XLMiner available from Resampling Stats, Inc.
This Excel add-in lets potential data miners not only examine the usefulness of such a
program but also get familiar with some of the data mining techniques. Although Excel is
quite limited in the number of observations it can handle, it can give the use a taste of how
valuable data mining can be – without expensing too much cost first.
Examples of use of information extracted from data mining exercises
Data mining has been used to help in credit scoring of customers in the financial industry
(Peng, 2004). Credit scoring can be defined as a technique that helps credit providers decide
whether to grant credit to customers. It‘s most common use is in making credit decisions for
loan applications. Credit scoring is also applied in decisions on personal loan applications –
the setting of credit limits, manage existing accounts and forecast the profitability of
consumers and customers (Punch, 2000).
11
Data mining and data warehousing has been particularly successful in the realm of customer
relationship management. By utilizing a data warehouse, retailers can embark on customerspecific strategies like customer profiling, customer segmentation, and cross-selling. By
using the information in the data warehouse, the business can divide its customers into four
quadrants of customer segmentation: (1) customers that should be eliminated (i.e., they cost
more than what they generate in revenues); (2) customers with whom the relationship should
be re-engineered (i.e., those that have the potential to be valuable, but may require the
company‘s encouragement, cooperation, and/ or management); (3) customers that the
company should engage; and (4) customers in which the company should in est (Buttle, 1999;
Verhoef & Donkers, 2001). The company then could use the corresponding strategies, to
manage the customer relationships (Cunningham et al, 2006)
Data mining can also help in the detection of spam in electronic mail (email) (Shih et al,
2008).
Data mining has also been used healthcare and acute care. A medical center in the US used
data mining technology to help its physicians work more efficiently and reduce mistakes
(Veluswamy, 2008).
There are other examples which we will not deal with here that have been flagship success
stories of data mining – the beer and diaper association; Harrah; Amazon and Netflix.
Essentials before you data mine
Apart from management buy in and financial backing, there are certain basics before you
embark on a data mining project. As data mining can only uncover patterns already present in
the data, the target dataset – you must already have the data and the data resides in a data
warehouse or a data mart — which must be large enough to contain these patterns while
remaining concise enough to be mined in an acceptable timeframe. The target set then needs
to be ―cleaned‖. This process removes the observations with noise and missing data. The
cleaned data is then reduced into feature vectors, one vector per observation. A feature vector
is a summarised version of the raw data observation.
Limitations of data mining
The quality of data mining applications depends on the quality and availability of data. As the
data set that needs to be mined should be of a certain quality, time and expense may be
needed to ―clean‖ the data that need to be mined.
Not to mention that the amount of data to be mined should be sufficiently large for the
software to extract meaningful patterns and association.
12
Also, as data mining requires huge amounts of resources – man hours, and financially — the
user must be a domain specialist and must understand business problems and be familiar with
data mining tools and techniques, so that resources are not wasted on a data mining project
that will fail at the start.
Also, once data have been mined, it is up to the management and decision makers to use the
information that has been extracted. Data mining is not the end all and the magic wand that
points the organization to what it should do. Human intellect and business acumen of the
decision makers is still very much required to make any sense out of the information that is
extracted from a data mining exercise.
Some issues surrounding data mining and data warehousing
1. You’ve data mined – do you think that the bosses will take the proper and appropriate
action – the dichotomy between use of sophisticated data mining software and techniques and
the conventionality of how organizations make decisions
Brydon and Gemino (2008) highlighted the dichotomy between the use of sophisticated data
mining software and techniques as opposed to the conventionality of how organisations make
decisions. They believed, rightly so, that ―tools and techniques for data mining and decision
making integration are still in their infancy. Firms must be willing to reconsider the ways in
which they make decisions if they are to realize a payoff from their investments in data
mining technology.‖
2. One size fits all data mining packages for industry. Does this fit the purpose of data mining
at all?
There are now available ―one size fits all‖ vertical applications for certain industries/ industry
segments developed by consultants. The consultants market these packages to all competitors
within that segment. This poses a potential risk for companies who are new to data mining as
when they explore the technique and these vertical ―off the shelf‖ solutions that their
competitors can also easily obtain.
Nevertheless, having said that the application of this technology is limited only by our
imagination, so that it is up to the companies to show and why they wish to use the
technology. They should also be aware of the fact that data mining is a long and resource
intensive exercise which an ―off the shelf‖ solution deceptively presents as easy and
affordable. Only companies that learn to be comfortable in utilising these tools on all
varieties of company data will benefit.
3. The use of data mining for prediction – use in non-commercial and ―problematic‖ areas.
E.g. prediction of terrorist acts
13
In 2002, the US government embarked on a massive data mining effort. Called the Total
Information Awareness The basic idea to collect as much data on everyone and sift this
through massive computers and investigate patterns that might indicate terrorist plots
(Schneier, 2006). However, a backlash of public opinion drove the US Congress to stop
funding the programme. Nevertheless, there is belief that the programme just changed its
name and moved inside the walls of the US Defence Department (Harris, 2006)
According to Schneier (2006), why data mining for use in such a situation will fail because
Terrorist plots are different from credit card fraud. Terrorist acts have no well-defined profile
and attacks are very rare. ―Taken together, these facts mean that data-mining systems won‘t
uncover any terrorist plots until they are very accurate, and that even very accurate
systems would be so flooded with false alarms that they will be useless.‖
This highlights the principle pointed earlier on in this paper – data mining is not a panacea of
all information problems and is not a magic wand to guide anyone out of the wilderness.
4. Ethical concerns over data warehousing and data mining – do you have any? Should
companies be concerned?
Data mining produces results only if it works with higher volumes of information at its
disposal. With the higher amounts of data that needs to be gathered, should we also be
concerned with the ethics behind the collection and use of that data.
As highlighted by Linstedt (2004), the implementers of the technology are simply told to
integrate data and the project manager builds a project to make it happen – these people
simply do not have the time to ponder whether the data had been handled ethically. Linstedt
proposes a checklist for project managers and technology implementers to address ethical
concerns over data:
Develop SLA‘s with end users that define who has access to what levels of
information
Have end-users involved in defining the ethical standards of use for the data that will
be delivered.
Define the bounds around the integration efforts of public data, where it will be
integrated and where it will not – so as to avoid conflicts of interest.
Do not use ―live‖ or real data for testing purposes – or lock down the test
environment; too often test environments are left wide-open and accessible to too
many individuals.
Define where, how, and who will be using Data Mining – restrict the mining efforts to
specific sets of information. Build a notification system to monitor data mining usage.
14
Allow customers to ―block‖ the integration of their own information (this one is
questionable) depending on if the customer information after integration will be made
available on the web.
Remember that any efforts made are still subject to governmental laws.
Nothing is sacred. If a government wants access to the information, they will get it.
1.3 Data warehouse
In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is
a database used for reporting (1) and data analysis (2). Integrating data from one or more
disparate sources creates a central repository of data, a data warehouse (DW). Data
warehouses store current and historical data and are used for creating trending reports for
senior management reporting such as annual and quarterly comparisons.
The data stored in the warehouse is uploaded from the operational systems (such as
marketing, sales, etc., shown in the figure to the right). The data may pass through an
operational data store for additional operations before it is used in the DW for reporting.
The typical extract transform load (ETL)-based data warehouse uses staging, data integration,
and access layers to house its key functions. The staging layer or staging database stores raw
data extracted from each of the disparate source data systems. The integration layer integrates
the disparate data sets by transforming the data from the staging layer often storing this
transformed data in an operational data store (ODS) database. The integrated data are then
moved to yet another database, often called the data warehouse database, where the data is
arranged into hierarchical groups often called dimensions and into facts and aggregate facts.
The combination of facts and dimensions is sometimes called a star schema. The access layer
helps users retrieve data.[1]
A data warehouse constructed from integrated data source systems does not require ETL,
staging databases, or operational data store databases. The integrated data source systems
may be considered to be a part of a distributed operational data store layer. Data federation
methods or data virtualization methods may be used to access the distributed integrated
source data systems to consolidate and aggregate data directly into the data warehouse
database tables.
15
Unlike the ETL-based data warehouse, the integrated source data systems and the data
warehouse are all integrated since there is no transformation of dimensional or reference data.
This integrated data warehouse architecture supports the drill down from the aggregate data
of the data warehouse to the transactional data of the integrated source data systems.
A data mart is a small data warehouse focused on a specific area of interest. Data warehouses
can be subdivided into data marts for improved performance and ease of use within that area.
Alternatively, an organization can create one or more data marts as first steps towards a larger
and more complex enterprise data warehouse.
This definition of the data warehouse focuses on data storage. The main source of the data is
cleaned, transformed, catalogued and made available for use by managers and other business
professionals for data mining, online analytical processing, market research and decision
support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to
extract, transform and load data, and to manage the data dictionary are also considered
essential components of a data warehousing system. Many references to data warehousing
use this broader context. Thus, an expanded definition for data warehousing includes business
intelligence tools, tools to extract, transform and load data into the repository, and tools to
manage and retrieve metadata.
Difficulties of Implementing Data Warehouses
Some significant operational issues arise with data warehousing: construction, administration,
and quality control. Project management—the design, construction, and implementation of