Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Data mining
Nội dung xem thử
Mô tả chi tiết
Data Mining:
Opportunities and
Challenges
John Wang
Montclair State University, USA
Hershey • London • Melbourne • Singapore • Beijing
IDEA GROUP PUBLISHING
Acquisition Editor: Mehdi Khosrow-Pour
Senior Managing Editor: Jan Travers
Managing Editor: Amanda Appicello
Development Editor: Michele Rossi
Copy Editor: Jane Conley
Typesetter: Amanda Appicello
Cover Design: Integrated Book Technology
Printed at: Integrated Book Technology
Published in the United States of America by
Idea Group Publishing (an imprint of Idea Group Inc.)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: [email protected]
Web site: http://www.idea-group.com
and in the United Kingdom by
Idea Group Publishing (an imprint of Idea Group Inc.)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 3313
Web site: http://www.eurospan.co.uk
Copyright © 2003 by Idea Group Inc. All rights reserved. No part of this book may be
reproduced in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Library of Congress Cataloging-in-Publication Data
Wang, John, 1955-
Data mining : opportunities and challenges / John Wang.
p. cm.
ISBN 1-59140-051-1
1. Data mining. I. Title.
QA76.9.D343 W36 2002
006.3--dc21
2002014190
eISBN
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
NEW from Idea Group Publishing
Excellent additions to your institution’s library! Recommend these titles to your Librarian!
To receive a copy of the Idea Group Publishing catalog, please contact (toll free) 1/800-345-4332,
fax 1/717-533-8661,or visit the IGP Online Bookstore at:
[http://www.idea-group.com]!
Note: All IGP books are also available as ebooks on netlibrary.com as well as other ebook sources.
Contact Ms. Carrie Stull at [[email protected]] to receive a complete list of sources
where you can obtain ebook information or IGP titles.
• Digital Bridges: Developing Countries in the Knowledge Economy, John Senyo Afele/ ISBN:1-59140-039-2;
eISBN 1-59140-067-8, © 2003
• Integrative Document & Content Management: Strategies for Exploiting Enterprise Knowledge, Len Asprey
and Michael Middleton/ ISBN: 1-59140-055-4; eISBN 1-59140-068-6, © 2003
• Critical Reflections on Information Systems: A Systemic Approach, Jeimy Cano/ ISBN: 1-59140-040-6; eISBN
1-59140-069-4, © 2003
• Web-Enabled Systems Integration: Practices and Challenges, Ajantha Dahanayake and Waltraud Gerhardt
ISBN: 1-59140-041-4; eISBN 1-59140-070-8, © 2003
• Public Information Technology: Policy and Management Issues, G. David Garson/ ISBN: 1-59140-060-0;
eISBN 1-59140-071-6, © 2003
• Knowledge and Information Technology Management: Human and Social Perspectives, Angappa Gunasekaran,
Omar Khalil and Syed Mahbubur Rahman/ ISBN: 1-59140-032-5; eISBN 1-59140-072-4, © 2003
• Building Knowledge Economies: Opportunities and Challenges, Liaquat Hossain and Virginia Gibson/ ISBN:
1-59140-059-7; eISBN 1-59140-073-2, © 2003
• Knowledge and Business Process Management, Vlatka Hlupic/ISBN: 1-59140-036-8; eISBN 1-59140-074-0, ©
2003
• IT-Based Management: Challenges and Solutions, Luiz Antonio Joia/ISBN: 1-59140-033-3; eISBN 1-59140-
075-9, © 2003
• Geographic Information Systems and Health Applications, Omar Khan/ ISBN: 1-59140-042-2; eISBN 1-59140-
076-7, © 2003
• The Economic and Social Impacts of E-Commerce, Sam Lubbe/ ISBN: 1-59140-043-0; eISBN 1-59140-077-5,
© 2003
• Computational Intelligence in Control, Masoud Mohammadian, Ruhul Amin Sarker and Xin Yao/ISBN: 1-59140-
037-6; eISBN 1-59140-079-1, © 2003
• Decision-Making Support Systems: Achievements and Challenges for the New Decade, M.C. Manuel Mora,
Guisseppi Forgionne and Jatinder N.D. Gupta/ISBN: 1-59140-045-7; eISBN 1-59140-080-5, © 2003
• Architectural Issues of Web-Enabled Electronic Business, Nansi Shi and V.K. Murthy/ ISBN: 1-59140-049-X;
eISBN 1-59140-081-3, © 2003
• Adaptive Evolutionary Information Systems, Nandish V. Patel/ISBN: 1-59140-034-1; eISBN 1-59140-082-1, ©
2003
• Managing Data Mining Technologies in Organizations: Techniques and Applications, Parag Pendharkar/
ISBN: 1-59140-057-0; eISBN 1-59140-083-X, © 2003
• Intelligent Agent Software Engineering, Valentina Plekhanova/ ISBN: 1-59140-046-5; eISBN 1-59140-084-8, ©
2003
• Advances in Software Maintenance Management: Technologies and Solutions, Macario Polo, Mario Piattini and
Francisco Ruiz/ ISBN: 1-59140-047-3; eISBN 1-59140-085-6, © 2003
• Multidimensional Databases: Problems and Solutions, Maurizio Rafanelli/ISBN: 1-59140-053-8; eISBN 1-
59140-086-4, © 2003
• Information Technology Enabled Global Customer Service, Tapio Reponen/ISBN: 1-59140-048-1; eISBN 1-
59140-087-2, © 2003
• Creating Business Value with Information Technology: Challenges and Solutions, Namchul Shin/ISBN: 1-
59140-038-4; eISBN 1-59140-088-0, © 2003
• Advances in Mobile Commerce Technologies, Ee-Peng Lim and Keng Siau/ ISBN: 1-59140-052-X; eISBN 1-
59140-089-9, © 2003
• Mobile Commerce: Technology, Theory and Applications, Brian Mennecke and Troy Strader/ ISBN: 1-59140-
044-9; eISBN 1-59140-090-2, © 2003
• Managing Multimedia-Enabled Technologies in Organizations, S.R. Subramanya/ISBN: 1-59140-054-6; eISBN
1-59140-091-0, © 2003
• Web-Powered Databases, David Taniar and Johanna Wenny Rahayu/ISBN: 1-59140-035-X; eISBN 1-59140-092-
9, © 2003
• E-Commerce and Cultural Values, Theerasak Thanasankit/ISBN: 1-59140-056-2; eISBN 1-59140-093-7, ©
2003
• Information Modeling for Internet Applications, Patrick van Bommel/ISBN: 1-59140-050-3; eISBN 1-59140-
094-5, © 2003
• Data Mining: Opportunities and Challenges, John Wang/ISBN: 1-59140-051-1; eISBN 1-59140-095-3, © 2003
• Annals of Cases on Information Technology – vol 5, Mehdi Khosrowpour/ ISBN: 1-59140-061-9; eISBN 1-
59140-096-1, © 2003
• Advanced Topics in Database Research – vol 2, Keng Siau/ISBN: 1-59140-063-5; eISBN 1-59140-098-8, ©
2003
• Advanced Topics in End User Computing – vol 2, Mo Adam Mahmood/ISBN: 1-59140-065-1; eISBN 1-59140-
100-3, © 2003
• Advanced Topics in Global Information Management – vol 2, Felix Tan/ ISBN: 1-59140-064-3; eISBN 1-
59140-101-1, © 2003
• Advanced Topics in Information Resources Management – vol 2, Mehdi Khosrowpour/ ISBN: 1-59140-062-7;
eISBN 1-59140-099-6, © 2003
Data Mining:
Opportunities and
Challenges
Table of Contents
Preface ......................................................................................................................... vii
John Wang, Montclair State University, USA
Chapter I
A Survey of Bayesian Data Mining ................................................................................1
Stefan Arnborg, Royal Institute of Technology and Swedish Institute of Computer
Science, Sweden
Chapter II
Control of Inductive Bias in Supervised Learning Using Evolutionary Computation:
A Wrapper-Based Approach ....................................................................................... 27
William H. Hsu, Kansas State University, USA
Chapter III
Cooperative Learning and Virtual Reality-Based Visualization for Data Mining ..... 55
Herna Viktor, University of Ottawa, Canada
Eric Paquet, National Research Council, Canada
Gys le Roux, University of Pretoria, South Africa
Chapter IV
Feature Selection in Data Mining ............................................................................... 80
YongSeog Kim, University of Iowa, USA
W. Nick Street, University of Iowa, USA
Filippo Menczer, University of Iowa, USA
Chapter V
Parallel and Distributed Data Mining through Parallel Skeletons and Distributed
Objects ...................................................................................................................... 106
Massimo Coppola, University of Pisa, Italy
Marco Vanneschi, University of Pisa, Italy
Chapter VI
Data Mining Based on Rough Sets ........................................................................... 142
Jerzy W. Grzymala-Busse, University of Kansas, USA
Wojciech Ziarko, University of Regina, Canada
Chapter VII
The Impact of Missing Data on Data Mining ............................................................ 174
Marvin L. Brown, Hawaii Pacific University, USA
John F. Kros, East Carolina University, USA
Chapter VIII
Mining Text Documents for Thematic Hierarchies Using Self-Organizing Maps . 199
Hsin-Chang Yang, Chang Jung University, Taiwan
Chung-Hong Lee, Chang Jung University, Taiwan
Chapter IX
The Pitfalls of Knowledge Discovery in Databases and Data Mining ....................... 220
John Wang, Montclair State University, USA
Alan Oppenheim, Montclair State University, USA
Chapter X
Maximum Performance Efficiency Approaches for Estimating Best Practice
Costs ........................................................................................................................ 239
Marvin D. Troutt, Kent State University, USA
Donald W. Gribbin, Southern Illinois University at Carbondale, USA
Murali S. Shanker, Kent State University, USA
Aimao Zhang, Georgia Southern University, USA
Chapter XI
Bayesian Data Mining and Knowledge Discovery .................................................... 260
Eitel J. M. Lauria, State University of New York, Albany, USA, Universidad
del Salvador, Argentina
Giri Kumar Tayi, State University of New York, Albany, USA
Chapter XII
Mining Free Text for Structure ................................................................................ 278
Vladimir A. Kulyukin, Utah State University, USA
Robin Burke, DePaul University, USA
Chapter XIII
Query-By-Structure Approach for the Web ............................................................. 301
Michael Johnson, Madonna University, USA
Farshad Fotouhi, Wayne State University, USA
Sorin Draghici, Wayne State University, USA
Chapter XIV
Financial Benchmarking Using Self-Organizing Maps – Studying the International
Pulp and Paper Industry ............................................................................................ 323
Tomas Eklund, Turku Centre for Computer Science, Finland
Barbro Back, Åbo Akademi University, Finland
Hannu Vanharanta, Pori School of Technology and Economics, Finland
Ari Visa, Tampere University of Technology, Finland
Chapter XV
Data Mining in Health Care Applications ................................................................ 350
Fay Cobb Payton, North Carolina State University, USA
Chapter XVI
Data Mining for Human Resource Information Systems ......................................... 366
Lori K. Long, Kent State University, USA
Marvin D. Troutt, Kent State University, USA
Chapter XVII
Data Mining in Information Technology and Banking Performance ...................... 382
Yao Chen, University of Massachusetts at Lowell, USA
Joe Zhu, Worcester Polytechnic Institute, USA
Chapter XVIII
Social, Ethical and Legal Issues of Data Mining ...................................................... 395
Jack S. Cook, Rochester Institute of Technology, USA
Laura L. Cook, State University of New York at Geneseo, USA
Chapter XIX
Data Mining in Designing an Agent-Based DSS ..................................................... 421
Christian Böhm, GIDSATD–UTN–FRSF, Argentina
María Rosa Galli, GIDSATD–UTN–FRSF and INGAR–CONICET, Argentina
Omar Chiotti, GIDSATD–UTN–FRSF and INGAR–CONICET, Argentina
Chapter XX
Critical and Future Trends in Data Mining: A Review of Key Data Mining
Technologies/Applications........................................................................................ 437
Jeffrey Hsu, Fairleigh Dickinson University, USA
About the Authors ..................................................................................................... 453
Index ........................................................................................................................ 462
Preface
vii
Data mining (DM) is the extraction of hidden predictive information from large databases (DBs). With the automatic discovery of knowledge implicit within DBs, DM uses
sophisticated statistical analysis and modeling techniques to uncover patterns and relationships hidden in organizational DBs. Over the last 40 years, the tools and techniques to
process structured information have continued to evolve from DBs to data warehousing
(DW) to DM. DW applications have become business-critical. DM can extract even more
value out of these huge repositories of information.
Approaches to DM are varied and often confusing. This book presents an overview
of the state of art in this new and multidisciplinary field. DM is taking off for several reasons:
organizations are gathering more data about their businesses, costs of storage have dropped
drastically, and competitive business pressures have increased. Other factors include the
emergence of pressures to control existing IT investments, and last, but not least, the marked
reduction in the cost/performance ratio of computer systems. There are four basic mining
operations supported by numerous mining techniques: predictive model creation supported
by supervised induction techniques; link analysis supported by association discovery and
sequence discovery techniques; DB segmentation supported by clustering techniques; and
deviation detection supported by statistical techniques.
Although DM is still in its infancy, companies in a wide range of industries - including
retail, banking and finance, heath care, manufacturing, telecommunication, and aerospace -
as well as government agencies are already using DM tools and techniques to take advantage of historical data. By using pattern-recognition technologies and statistical and mathematical techniques to sift through warehoused information, DM helps analysts recognize
significant facts, relationships, trends, patterns, exceptions, and anomalies that might otherwise go unnoticed.
In my February 2001 call for chapters, I sought contributions to this book that would
address a vast number of issues ranging from the breakthrough of new theories to case
studies of firms’ experiences with their DM. After spending one and a half years of preparation on the book and a strict peer-refereed process, I am delighted to see it appearing on the
market. The primary objective of this book is to explore the myriad issues regarding DM,
specifically focusing on those areas that explore new methodologies or examine case studies. A broad spectrum of scientists, practitioners, graduate students, and managers, who
perform research and/or implement the discoveries, are the envisioned readers of this book.
The book contains a collection of twenty chapters written by a truly international team
of forty-four experts representing the leading scientists and talented young scholars from
seven countries (or areas): Argentina, Canada, Italy, South Africa, Sweden, Taiwan, and the
United States.
Chapter 1 by Arnborg reviews the fundamentals of inference and gives a motivation
for Bayesian analysis. The method is illustrated with dependency tests in data sets with
categorical data variables, and the Dirichlet prior distributions. Principles and problems for
deriving causality conclusions are reviewed and illustrated with Simpson’s paradox. Selection of decomposable and directed graphical models illustrates the Bayesian approach.
Bayesian and Expectation Maximization (EM) classification is described briefly. The material is illustrated by two cases, one in personalization of media distribution, and one in
schizophrenia research. These cases are illustrations of how to approach problems that exist
in many other application areas.
Chapter 2 by Hsu discusses the problem of Feature Selection (also called Variable
Elimination) in supervised inductive learning approaches to DM, in the context of controlling Inductive Bias - i.e., any preference for one (classification or regression) hypothesis
other than pure consistency with training data. Feature selection can be achieved using
combinatorial search and optimization approaches. This chapter focuses on data-driven
validation-based techniques, particularly the WRAPPER approach. Hsu presents a wrapper
that uses Genetic Algorithms for the search component and a validation criterion, based
upon model accuracy and problem complexity, as the Fitness Measure. This method is
related to the Classifier System of Booker, Golderberg and Holland (1989). Current research
relates the Model Selection criterion in the fitness to the Minimum Description Length
(MDL) family of learning criteria. Hsu presents two case studies in large-scale commercial
DM and decision support: crop condition monitoring, and loss prediction for insurance
pricing. Part of these case studies includes a synopsis of the general experimental framework, using the Machine Learning in Java (MLJ) and Data to Knowledge (D2K) Java-based
visual programming systems for DM and information visualization.
Chapter 3 by Herna Viktor, Eric Paquet, and Gys le Roux explores the use of visual DM
and virtual reality-based visualization in a cooperative learning environment. The chapter
introduces a cooperative learning environment in which multiple DM tools reside and describes the ViziMine DM tool used to visualize the cooperative DM process. The aim of the
ViziMine tool is twofold. Firstly, the data repository is visualized during data preprocessing
and DM. Secondly, the knowledge, as obtained through DM, is assessed and modified
through the interactive visualization of the cooperative DM process and its results. In this
way, the user is able to assess and possibly improve the results of DM to reflect his or her
domain expertise. Finally, the use of three-dimensional visualization, virtual reality-based
visualization, and multimedia DM is discussed. The chapter shows how these leading-edge
technologies can be used to visualize the data and its descriptors.
Feature subset selection is an important problem in knowledge discovery, not only for
the insight gained from determining relevant modeling variables but also for the improved
understandability, scalability, and possibly, accuracy of the resulting models. The purpose
of Chapter 4 is to provide a comprehensive analysis of feature selection via evolutionary
search in supervised and unsupervised learning. To achieve this purpose, Kim, Street, and
Menczer first discuss a general framework for feature selection based on a new search
algorithm, Evolutionary Local Selection Algorithm (ELSA). The search is formulated as a
multi-objective optimization problem to examine the trade-off between the complexity of the
generated solutions against their quality. ELSA considers multiple objectives efficiently
while avoiding computationally expensive global comparison. The authors combine ELSA
with Artificial Neural Networks (ANNs) and the EM algorithm for feature selection in superviii
vised and unsupervised learning, respectively. Further, they show a new two-level evolutionary algorithm, Meta-Evolutionary Ensembles (MEE), in which feature selection is used
to promote diversity among classifiers for ensemble classification.
Coppola and Vanneschi consider the application of parallel programming environments to develop portable and efficient high-performance DM tools. They discuss the main
issues in exploiting parallelism in DM applications to improve the scalability of several
mining techniques to large or geographically distributed DBs. The main focus of Chapter 5
is on parallel software engineering, showing that the skeleton-based, high-level approach
can be effective both in developing portable high-performance DM kernels, and in easing
their integration with other data management tools. Three test cases are described that
present parallel algorithms for association rules, classification, and clustering, starting from
the problem and going up to a concrete implementation. Experimental results are discussed
with respect to performance and software costs. To help the integration of high-level application with existing environments, an object-oriented interface is proposed. This interface
complements the parallel skeleton approach and allows the use of a number of external
libraries and software modules as external objects, including shared-memory-distributed
objects.
Rough set theory, originated by Z. Pawlak in 1982, among other applications, is a
methodological tool for DM and machine learning. The main advantage of rough set theory
is that it does not need any preliminary or additional information about data (such as probability distribution assumptions in probability classifier theory, grade of membership in fuzzy
set theory, etc.). Numerical estimates of uncertainty of rough set theory have immediate
interpretation in evidence theory (Dempster-Shafer theory). The chapter “Data Mining
Based on Rough Sets” by Grzymala-Busse and Ziarko starts from fundamentals of rough set
theory. Then two generalizations of rough set theory are presented: Variable Precision
Rough Set Model (VPRSM) and Learning from Examples using Rough Sets (LERS). The
prime concern of VPRSM is forming decision tables, while LERS produces rule sets. The two
generalizations of rough set theory are independent and neither can be reduced to the other.
Among many applications of LERS, those related to medical area and natural language are
briefly described.
DM is based upon searching the concatenation of multiple DBs that usually contain
some amount of missing data along with a variable percentage of inaccurate data, pollution,
outliers, and noise. During the last four decades, statisticians have attempted to address the
impact of missing data on IT. Chapter 7 by Brown and Kros commences with a background
analysis, including a review of both seminal and current literature. Reasons for data inconsistency along with definitions of various types of missing data are discussed. The chapter
mainly focuses on methods of addressing missing data and the impact that missing data has
on the knowledge discovery process via prediction, estimation, classification, pattern recognition, and association rules. Finally, trends regarding missing data and DM are discussed,
in addition to future research opportunities and concluding remarks.
In Chapter 8, Yang and Lee use a self-organizing map to cluster documents and form
two feature maps. One of the map, namely the document cluster map, clusters documents
according to the co-occurrence patterns of terms appeared in the documents. The other
map, namely the word cluster map, is obtained by selecting the words of common interest for
those documents in the same cluster. They then apply an iterative process to these maps to
discover the main themes and generate hierarchies of the document clusters. The hierarchy
generation and theme discovery process both utilize the synaptic weights developed after
the clustering process using the self-organizing map. Thus, their technique incorporates the
ix
knowledge from the neural networks and may provide promising directions in other knowledge-discovery applications. Although this work was originally designed for text categorization tasks, the hierarchy mining process developed by these authors also poses an
interesting direction in discovering and organizing unknown knowledge.
Although DM may often seem a highly effective tool for companies to be using in their
business endeavors, there are a number of pitfalls and/or barriers that may impede these
firms from properly budgeting for DM projects in the short term. In Chapter 9, Wang and
Oppenheim indicate that the pitfalls of DM can be categorized into several distinct categories. The authors explore the issues of accessibility and usability, affordability and efficiency, scalability and adaptability, systematic patterns vs. sample-specific patterns, explanatory factors vs. random variables, segmentation vs. sampling, accuracy and cohesiveness, and standardization and verification. Finally, they present the technical challenges
regarding the pitfalls of DM.
Chapter 10 by Troutt, Gribbin, Shanker, and Zhang proposes the principle of Maximum
Performance Efficiency (MPE) as a contribution to the DM toolkit. This principle seeks to
estimate optimal or boundary behavior, in contrast to techniques like regression analysis
that predict average behavior. This MPE principle is explained and used to estimate bestpractice cost rates in the context of an activity-based costing situation where the authors
consider multiple activities contributing to a single cost pool. A validation approach for this
estimation method is developed in terms of what the authors call normal-like-or-better performance effectiveness. Extensions to time series data on a single unit, and marginal costoriented basic cost models are also briefly described.
One of the major problems faced by DM technologies is how to deal with uncertainty.
Bayesian methods provide an explicit way of using probability for quantifying uncertainty.
The purpose of Chapter 11 by Lauria and Tayi is twofold: to provide an overview of the
theoretical framework of Bayesian methods and its application to DM, with special emphasis
on statistical modeling and machine learning techniques. Topics covered include Bayes
Theorem and its implications, Bayesian classifiers, Bayesian belief networks, statistical computing, and an introduction to Markov Chain Monte Carlo techniques. The coverage of
these topics has been augmented by providing numerical examples.
Knowledge of the structural organization of information in documents can be of significant assistance to information systems that use documents as their knowledge bases. In
particular, such knowledge is of use to information retrieval systems that retrieve documents
in response to user queries. Chapter 12 by Kulyukin and Burke presents an approach to
mining free-text documents for structure that is qualitative in nature. It complements the
statistical and machine learning approaches insomuch as the structural organization of information in documents is discovered through mining free text for content markers left behind
by document writers. The ultimate objective is to find scalable DM solutions for free-text
documents in exchange for modest knowledge engineering requirements.
Chapter 13 by Johnson, Fotouhi, and Draghici presents three systems that incorporate
document structure information into a search of the Web. These systems extend existing
Web searches by allowing the user to not only request documents containing specific
search words, but also to specify that documents be of a certain type. In addition to being
able to search a local DB, all three systems are capable of dynamically querying the Web.
Each system applies a query-by-structure approach that captures and utilizes structure
information as well as content during a query of the Web. Two of the systems also employ
Neural Networks (NNs) to organize the information based on relevancy of both the content
and structure. These systems utilize a supervised Hamming NN and an unsupervised comx
petitive NN, respectively. Initial testing of these systems has shown promising result when
compared to straight keyword searches.
Chapter 14 seeks to evaluate the feasibility of using self-organizing maps (SOMs) for
financial benchmarking of companies. Eklund, Back, Vanharanta, and Visa collected a number of annual reports from companies in the international pulp and paper industry, for the
period 1995-2000. They then create a financial DB consisting of a number of financial ratios,
calculated based on the values from the income and balance sheets of the annual reports.
The financial ratios used were selected based on their reliability and validity in international
comparisons. The authors also briefly discuss issues related to the use of SOMs, such as
data pre-processing, and the training of the map. The authors then perform a financial
benchmarking of the companies by visualizing them on a SOM. This benchmarking includes
finding the best and poorest performing companies, illustrating the effects of the Asian
financial crisis, and comparing the performance of the five largest pulp and paper companies.
The findings are evaluated using existing domain knowledge, i.e., information from the textual parts of the annual reports. The authors found the SOM to be a feasible tool for financial
benchmarking.
In Chapter 15, general insight into DM with emphasis on the health care industry is
provided by Payton. The discussion focuses on earlier electronic commerce health care
initiatives, namely community health information networks (CHINs). CHINs continue to be
widely debated by leading industry groups, such as The Healthy Cities Organization and
The IEEE-USA Medical Technology and Policy Committee. These applications raise issues
about how patient information can be mined to enable fraud detection, profitability analysis,
patient profiling, and retention management. Withstanding these DM capabilities, social
issues abound.
In Chapter 16, Long and Troutt discuss the potential contributions DM could make
within the Human Resource (HR) function. They provide a basic introduction to DM techniques and processes and survey the literature on the steps involved in successfully mining
this information. They also discuss the importance of DW and datamart considerations. A
discussion of the contrast between DM and more routine statistical studies is given. They
examine the value of HR information to support a firm’s competitive position and for support
of decision-making in organizations. Examples of potential applications are outlined in terms
of data that is ordinarily captured in HR information systems. They note that few DM
applications have been reported to date in the literature and hope that this chapter will spur
interest among upper management and HR professionals.
The banking industry spends a large amount of IT budgets with the expectation that
the investment will result in higher productivity and improved financial performance. However, bank managers make decisions on how to spend large IT budgets without accurate
performance measurement systems on the business value of IT. It is a challenging DM task
to investigate banking performance as a result of IT investment, because numerous financial
and banking performance measures are present with the new IT cost category. Chapter 17 by
Chen and Zhu presents a new DM approach that examines the impact of IT investment on
banking performance, measures the financial performance of banking, and extracts performance patterns. The information obtained will provide banks with the most efficient and
effective means to conduct business while meeting internal operational performance goals.
Chapter 18 by Cook and Cook highlights both the positive and negative aspects of
DM. Specifically, the social, ethical, and legal implications of DM are examined through
recent case law, current public opinion, and small industry-specific examples. There are
many issues concerning this topic. Therefore, the purpose of this chapter is to expose the
xi
reader to some of the more interesting ones and provide insight into how information systems (ISs) professionals and businesses may protect themselves from the negative ramifications associated with improper use of data. The more experience with and exposure to social,
ethical, and legal concerns with respect to DM, the better prepared the reader will be to
prevent trouble in the future.
Chapter 19 by Böhm, Galli, and Chiotti presents a DM application to software engineering. Particularly, it describes the use of DM in different parts of the design process of a
dynamic decision-support system agent-based architecture. By using DM techniques, a
discriminating function to classify the system domains is defined. From this discriminating
function, a system knowledge base is designed that stores the values of the parameters
required by such a function. Also, by using DM, a data structure for analyzing the system
operation results is defined. According to that, a case base to store the information of
performed searches quality is designed. By mining this case base, rules to infer possible
causes of domains classification error are specified. Based on these rules, a learning mechanism to update the knowledge base is designed.
DM is a field that is experiencing rapid growth and change, and new applications and
developments are constantly being introduced. While many of the traditional statistical
approaches to DM are still widely used, new technologies and uses for DM are coming to the
forefront. The purpose of Chapter 20 is to examine and explore some of the newer areas of
DM that are expected to have much impact not only for the present, but also for the future.
These include the expanding areas of Web and text mining, as well as ubiquitous, distributed/collective, and phenomenal DM. From here, the discussion turns to the dynamic areas
of hypertext, multimedia, spatial, and geographic DM. For those who love numbers and
analytical work, constraint-based and time-eries mining are useful ways to better understand
complex data. Finally, some of the most critical applications are examined, including
bioinformatics.
References
Booker, L.B., Goldberg, D.E., & Holland, J.H. (1989). Classifier Systems and
Genetic Algorithms. Artificial Intelligence, 40, 235-282.
Pawlak, Z. (1982). Rough Sets. International Journal of Computer and
Information Sciences, 11, 341-356.
xii
The editor would like to acknowledge the help of all involved in the development and
review process of the book, without whose support the project could not have been satisfactorily completed. Thanks go to all those who provided constructive and comprehensive
reviews. However, some of the reviewers must be mentioned as their reviews set the benchmark. Reviewers who provided the most comprehensive, critical, and constructive comments include: Nick Street of University of Iowa; Marvin D. Troutt of Kent State University;
Herna Viktor of University of Ottawa; William H. Hsu of Kansas State University; Jack S.
Cook of Rochester Institute of Technology; and Massimo Coppola of University of Pisa.
The support of the Office of Research and Sponsored Programs at Montclair State
University is hereby graciously acknowledged for awarding me a Career Development Project
Fund in 2001.
A further special note of thanks goes also to the publishing team at Idea Group
Publishing, whose contributions throughout the whole process— from inception of the idea
to final publication— have been invaluable. In particular, thanks to Michelle Rossi, whose
continuous prodding via e-mail kept the project on schedule, and to Mehdi Khosrowpour,
whose enthusiasm motivated me to initially accept his invitation to take on this project. In
addition, Amanda Appicello at Idea Group Publishing made numerous corrections, revisions, and beautifications. Also, Carrie Stull Skovrinskie helped lead the book to the market.
In closing, I wish to thank all of the authors for their insights and excellent contributions to this book. I also want to thank a group of anonymous reviewers who assisted me in
the peer-review process. In addition, I want to thank my parents (Houde Wang & Junyan
Bai) for their encouragement, and last but not least, my wife Hongyu Ouyang for her
unfailing support and dedication during the long development period, which culminated in
the birth of both this book and our first boy, Leigh Wang, almost at the same time. Like a
baby, DM has a bright and promising future.
John Wang, Ph.D.
Montclair State University
March 31, 2002
Acknowledgments
xiii