Data Mining Solutions: Methods and Tools for Solving Real-World Problems

Want to read this on your NOOK? Request as NOOK Book from the publisher

Thank you for requesting this book as a NOOK book from the publisher.

Customers Who Bought This Also Bought

More About This Book

Overview
Editorial Reviews
Product Details
Related Subjects
Meet the Author
Read an Excerpt
Table of Contents
Interviews & Essays

Overview

Cutting-edge data mining techniques and tools for solving your toughest analytical problems

Data Mining Solutions

In down-to-earth language, data mining experts Christopher Westphal and Teresa Blaxton introduce a brand new approach to data mining analysis. Through their extensive real-world experience, they have developed and documented many practical and proven techniques to make your own data mining efforts more successful. You'll get a refreshing "out-of-the-box" approach to data mining that will help you maximize your time and problem-solving resources, and prepare for the next wave of data mining-visualization. You will read about ways in which data mining has been used to:
* Discover patterns of insider trading in the stock market
* Evaluate the utility of marketing campaigns
* Analyze retail sales patterns across geographic regions
* Identify money laundering operations
* Target DNA sequences for pharmaceutical testing and development

The book is accompanied by a CD-ROM that contains:
* Demo and trial versions of numerous visual data mining tools
* Active web-page links for each of the products profiled
* GIF files corresponding to all book images

Read More Show Less

Editorial Reviews

Booknews

Constructed along two broad dimensions, this text first presents a set of methodologies for preparing environments that perform data mining. The second part presents an overview of data mining as it is currently being done with a variety of technologies, emphasizing the use of data visualization with case studies of actual data mining engagements. A CD-ROM contains demo and trial versions of visual data mining tools, active web-page links for each profiled product and GIF files corresponding to book images. Annotation c. by Book News, Inc., Portland, Or.

Read More Show Less

Product Details

ISBN-13: 9780471253846
Publisher: Wiley, John & Sons, Incorporated
Publication date: 1/28/1998
Series: Toolkits Series , #3
Edition description: BK&CD-ROM
Edition number: 1
Pages: 640
Product dimensions: 7.51 (w) x 9.18 (h) x 1.44 (d)

Meet the Author

CHRISTOPHER WESTPHAL has been successfully performing visual data mining engagements and research for over a decade for both government and commercial interests. He has developed a wide range of methodologies and unique techniques to support the data mining process. He may be reached at chris_westphalyahoo.com.

TERESA BLAXTON is an Associate Professor of Psychiatry in the University of Maryland School of Medicine. She has published over 100 articles on knowledge storage, representation, and retrieval, both in biological (human) and artificial (computer) systems. She may be reached at tblaxtonmprcwb.ab.umd.edu.

Read More Show Less

Read an Excerpt

Data Mining Solutions Methods and Tools for Solving Real-World Problems

ISBN: 0471253847

Christopher Westphal/ Teresa Blaxton

FOREWORD
The term data mining can be used to describe a wide range of activities. A marketing company using historical response data to build models to predict who will respond to a direct mail or telephone solicitation is using data mining. A manufacturer analyzing sensor data to isolate conditions that lead to unplanned production stoppages is also using data mining. The government agency that sifts through the records of financial transactions looking for patterns that indicate money laundering or drug smuggling is mining data for evidence of criminal activity. An automated search through a document archive or the World Wide Web for articles on a certain topic may be thought of as data mining as well.

With so many different activities all going by the same name, there is ample room for confusion. Talk to two self-described data miners, and you may find that they work with completely different kinds of data to address completely different problems using completely different tools and techniques. Recently, at a data mining conference, I went to dinner with several other data mining practitioners. Over a delicious and elegantly prepared meal of sushi and sashimi, we discussed our work. By the time the second pot of green tea arrived it was clear there was very little overlap in our experience. One of us had worked on systems to tell the difference between whales and submarines in SONAR data, another had studied property insurance claims from areas hit by earthquakes and hurricanes in order to help plan federal disaster relief programs. My own project at the time involved studying data from several printing plants in order to get a better understanding of the factors contributing to paper waste for a publishing company. It was fascinating to hear each other's stories. It was also clear that we were all using the same term to describe very different activities.

How can we make sense of such an amorphous topic? Taking a cue from the online analytic processing OLAP world, we can impose some order on the field by thinking of it as a three-dimensional space defined by the following axes: ° Underlying task ° Nature of goal ° Degree of structure in the data

By locating a given data mining project along each of these axes, we can place it in a box with other similar projects. Each box calls for a different mix of data mining algorithms and approaches. Let's take a closer look at each of these dimensions.

Data Mining Tasks
There are many ways that data mining tasks could be classified. In fact, Gordon Linoff and I present a slightly different classification of data mining tasks in our book Data Mining Techniques for Marketing, Sales and Customer Support John Wiley & Sons, 1997. To keep our model simple, we want a small number of broadly defined tasks that can be used to categorize the underlying activity in a data mining project. For this purpose, I have given our data mining task dimension four members each of which is defined below: ° Classification ° Estimation ° Segmentation ° Description

Classification
A classification task consists of placing labels on records. The labels come from a small predefined set good/ bad or red/ white/ rosé. The job of the data miner is to build a model that will successfully route incoming records into the correctly labeled bucket. Prediction is simply a special case of classification where you may have to wait a bit longer to find out if your classification was correct.

Estimation
Estimation is the task of filling in a missing usually, but not always, numeric value in a particular field of an incoming record as a function of other fields in the record. The usual statistical regression techniques are most often employed for estimation. Estimation is also a popular application of artificial neural networks. Figuring out who is likely to respond to a credit card balance transfer offer is a classification task. Figuring out the likely size of the transferred balance is an estimation task.

Segmentation
It often happens that, in a large population, there are so many competing patterns that they cancel one another. In order to see what is really going on, you have to break the population into smaller sub-populations having similar behavior. Within these sub-populations, all kinds of predictions may be possible. Various cluster detection, affinity grouping and link analysis techniques can be applied to segmentation tasks.

Description
Data mining can also be employed to give the analyst a clearer idea of what is going on in the data. Visualization techniques that allow clusters and linkages that would be imperceptible in a table or textual display to be spotted quickly and intuitively by eye fall under this heading. Descriptive data mining can be used to perform classification tasks even when the classes themselves are not well defined. In fraud detection, for example, a decision tree classifier can only be built based on known examples of fraudulent behavior. A good link analysis program coupled with visualization can turn this problem into one that humans perform very well--looking at a picture and spotting the anomalous or "interesting" bits.

Predictive and Descriptive
Data Mining Goals The next axis of our data mining cube puts data mining activities into one of two bins depending on whether the primary goal of the exercise is to produce models or information. On the predictive end of the spectrum, the goal of data mining is to produce a model, expressed as executable code, that can be used to score a database, route claims, or perform some other classification or estimation task. On the descriptive end, the goal is to gain understanding by uncovering patterns and relationships. For example, this book contains numerous examples of forensic data mining where the goal is to uncover evidence of criminal activity.

Structured and Unstructured Data
Structured data is the kind of thing found in most computer databases--fixed length, fixed format records with fields that contain numeric values, character codes or short strings. The data mining techniques that work well for this kind of data are often incapable of coping with less structured data such as free text or data that are structured very differently such as pixel maps derived from photographic images.

The Book You are About to Read Now that I have constructed this cubic conceptual framework for data mining, I can see that my own work is concentrated in the corner where the primary task is classification or prediction, the goal is a predictive model and the data are highly structured. It was refreshing to read a book written by people whose primary practice is at the opposite corner of the cube where the tasks are largely descriptive, the goal is to gain understanding, and much of the data are unstructured.

Christopher Westphal and Teresa Blaxton have expanded my view of the data mining landscape with their wide-ranging coverage of topics such as phonetic name matching, visual methods for analyzing data, and the distinction between procedural and declarative knowledge. Their extensive coverage of data mining software tools will also prove valuable to readers who are planning to set up their own data mining environments. This book covers ground that has not been explored in any other book geared to a general audience. Enjoy it!

Michael Berry

INTRODUCTION
Our work within the field of data mining had convinced us that there is an overwhelming demand for a resource text that addresses all aspects of how to best perform a successful data mining endeavor. Although the enterprise of data mining has been part of the information technology vernacular for several years, many people still view the process as black magic. This is partly because no established road maps or procedures have been formally identified for guiding analysts to profitable outcomes. Thus, data mining activities often proceed in a haphazard fashion with mixed results. This book will lead you through a set of methodologies and related technologies that address all phases of the data mining process and bring together issues that have never before been discussed in the context of data mining. Our vision is that this book will become an invaluable resource for people engaged in analyzing patterns and trends within complex data sources. Therefore, the book is designed to provide value to curious students, sophisticated investigators, programmers and system designers, as well as corporate management personnel.

In this book, we describe how best to prepare environments for performing data mining and discuss approaches that have proven to be critical in revealing important patterns and trends in large data sets. From our experience, one of the best data mining technologies available is visualization because graphical display methods often offer superior results compared to other more conventional data mining techniques. Visual tools have traditionally been used by high-end intelligence agencies but have recently become accessible and useful as a practical, cost-effective approach for many businesses and corporate settings. In the interest of providing a complete picture of data mining, this book contains descriptions of a variety of approaches and tools that offer useful capabilities for the solution of pattern detection problems in data sets. As the use of visualization techniques becomes more mainstream, we feel confident that many non-visual tools will be extended to include the use of visual display interfaces.

This book will further provide you with a comprehensive review of the principles, techniques, and applications that can be used to perform data mining. It is our expectation that once you have completed this text, you will be able to initiate a data mining activity effectively. Data Mining Solutions is a text that can be used by a wide range of readers, from beginners wishing to learn about the enterprise of data mining to analysts and technical programmers who will be engaged directly in the process. This book reviews some fundamental methods that can be used to structure and analyze large quantities of data derived from multiple sources. It also discusses a wide array of technologies and products that have been developed to support data mining analysis. Although it is easy to focus on the technologies, as you read through the book please bear in mind that technology alone does not provide the entire solution. Rather it is the methodologies you use for applying the technologies to specific problems that is of paramount importance in determining the success of data mining activities. No tool, no matter how advanced, will produce useful results unless it is applied with the proper methodological approach.

It is our expectation that after reading this text, you will have the requisite knowledge to tackle all phases of the data mining process. To this end, the book is constructed along two broad dimensions. First, we present a set of methodologies for approaching the data mining process. These are based largely on the theories and processes that we have developed and found useful over the past several years while implementing visual data mining systems. We have addressed all aspects of data mining, including formatting and extraction of data, the creation of models to be applied to extracted data, selection of display techniques, and interpretation of patterns revealed by the process.

The second aim of the book presents a comprehensive overview of data mining as it is currently being done with a variety of technologies with heavy emphasis on the use of data visualization. There is presently no other publication of this length and detail that allows the user to make comparisons among systems and/ or techniques in order to choose the one s that provides the most appropriate solution for a given application. Our philosophy is that it is highly unlikely that any single product or approach can handle the entire data mining process. Rather, solutions are based on a combination of technologies and methodologies. Therefore data mining practitioners must consider the organizational requirements, available data sources, and corporate policies and procedures pending on every project. This book provides the background information necessary to evaluate options and make informed decisions at all phases of the data mining activity. All topics discussed have real-world examples provided to make understanding the process simple, straightforward, and transferable to the reader's own domain of interest.

The book consists of four sections with sixteen chapters written by experienced authors who have been performing advanced data mining using information visualization technologies in such varied domains as narco-terrorism, money laundering, insider trading, claims fraud, retail sales, biomedical research, and telecommunications. We have worked with many different clients on all sorts of data mining applications to help them detect patterns, expose inconsistent or suspicious structures, and reveal criteria activities within their data sets. As you go through the book, you will see real-world examples that serve as illustrations of the ways in which data mining principles and technologies can be applied. These examples are drawn from actual engagements, many of which we have participated in directly. The various examples and tools were selected for their utility in addressing various classes of problems encountered during data mining activities. We have placed special emphasis on the utilization of visualization, but also discuss and contrast nonvisual methods where appropriate.

One of our goals in writing this book was to minimize the hype associated with data mining. Rather than making false promises that overstep the bounds of what can reasonably be expected from data mining, we have tried to take a more objective approach. We describe the processes and procedures that are necessary to produce reliable and useful results in data mining applications. We do not advocate the use of any particular product over another. Rather, we have tried to give you enough information to decide which data mining tools and techniques are most appropriate for your particular situation.

SECTION I: Defining the Data Mining Approach. This section addresses important principles about the nature of data. To be an effective data mining practitioner, you need to understand all aspects of the data that will be used in the analyses. Knowing how to exploit data effectively can help you to use available technologies to reveal the hidden patterns and trends contained therein. To some readers, this discussion may appear trivial; however, it is probably one of the most important sections of the book. This discussion provides you with an overview of many important factors and approaches to consider when performing data mining.

SECTION II: Preparing and Analyzing Data. The techniques used to manipulate data are based on a variety of analytical methodologies. There are indicators contained within every data set that will reveal patterns; the difficulty is in knowing what type of patterns are available, what the predominate cues are, and how to interpret the results. Although every domain is unique, many features contained within different sources of data will embody consistent traits that can be globally exploited. This discussion reviews many of these approaches and methodologies and is structured to reflect an intelligence production cycle for accessing data, analyzing patterns, and presenting results.

SECTION III: Assessing Data Mining Tools and Technologies. There are a host of software tools and technologies available to businesses interested in performing data mining; this section presents you with an overview of categories of such tools and a sampling of the approaches available. The intent is not to provide a feature-specific review of tools, but to discuss the effectiveness of different data mining solutions for specific types of problems.

In Section III we focus on the application of visualization technologies available in data mining tools. Visualization tools are becoming more and more popular among data miners, and you need to know what types of tools are available, when to use them, and how to use them appropriately in your analytical engagements. In this section we introduce a host of visualization systems and briefly discuss their unique capabilities and typical uses. We divide visualization paradigms into three categories -link analysis, landscapes, and quantitative displays. Each chapter provides you with relevant examples and screen captures, information about ease of-use, and other factors that businesses need to consider when applying visual data mining technologies to their environments. All tools presented in this section apply to both large and small business concerns. The section concludes with a future trends chapter that provides insight into pending developments in data mining.

SECTION IV: Case Studies. To tie everything neatly together, we present a series of descriptions of actual data mining engagements. The examples provide you with explicit illustrations of the types of patterns that typically can be found within the application domains presented. Each chapter in this section provides the background knowledge necessary to understand the domain and its associated problems. In addition we have tried to include descriptions of the logistical difficulties that the analysts experienced during the engagement. Therefore each case study includes war stories about real problems that were encountered as well as the useful outcomes that were derived from the analyses.

APPENDIX: Tool and Technology Resources. This section has information on the tools referenced throughout the book. We strongly suggest that you consult directly with vendors if you are interested in applying any of these tools in an application of your own. As with all quickly progressing technologies, numerous enhancements and features will have been added to many of these systems by the time this book goes into press. Therefore, we ask that you check directly with them to learn about these changes before making any final decisions.

Our hope for this book is that it will a fulfill a need that has not been met previously by providing you with a roadmap for the data mining process. The information presented here was gleaned from years of experience with many real world data mining cases. We have tried to extrapolate general principles from those experiences that you can put to use in your own work. We believe that this book provides information that will put you ahead of the game when developing your own data mining applications. We hope you enjoy reading this book and learn from its content.

Section I D E F I N I N G T H E DATA MI N I N G AP P R O A C H

We have gradually grown accustomed to the fact that there are tremendous volumes of data filling our computers, networks, and lives. Government agencies and businesses have all dedicated enormous resources to collecting and storing this information. Presumably, information is amassed because someone at some point in time imagined an important use for it. In reality, however, only a small amount of these data will ever be used because in many cases the volumes are simply too large to manage or the data structures themselves are too complicated to be accessed effectively.

How could this happen? The primary reason is that the original effort to create a data set is often focused on issues such as storage efficiency and does not include a plan for how the data will eventually be used and analyzed. Thus, by the time analysts wish to use a data set to answer questions, they often find themselves at a significant disadvantage with little hope for future success. Data mining methodologies are aimed at solving some of these problems. The ultimate goal of a data mining exercise is to discover hidden patterns in these complex information sources.

Making the Most of Your Resources
What data do you have available to you for analysis? Data warehouses, legacy databases, and corporate information systems are ideal information sources to be used for detecting patterns and trends through the application of data mining techniques. As you think about the sources of information that can be used to perform data mining, keep an open mind about what sorts of data might be useful. Of course your main customer-tracking databases, accounting systems, inventory control, and other very obvious and explicit data sets are good candidates. Now dig a bit deeper. Where else can you derive relevant data within your organization? If you are doing forensic accounting, you might be able to use phone-call data collected by the internal telephone switch at most corporations. Communication patterns may also be traced by accessing records of e-mail transactions maintained on your network servers. If your application question bears on issues concerning access to building facilities, consider that some security departments have badge reader data that record time and ID numbers of people entering and leaving certain locations. Your personnel department maintains all of the home address and phone data on nonexempt staff. Inquiries about embezzlement or collusion may benefit from the comparison of these data to those contained on suspicious vendor invoices. When you think about it, expense reports, corporate credit card statements, travel advances, insurance claims, equipment maintenance reports, point of-sale charges, warranty records, damaged goods receipts, shipping invoices, merchandise sales and returns, weather reports, and just about any other piece of information collected can be used in a data min ing application.

In the event that the necessary data are not available or do not yet exist, they can usually be generated. For example, suppose that your information is reported at a level of detail above what is needed for the analysis. Imagine that you had a medical application in which you were trying to trace the origins of a series of apparent adverse reactions to some combination of medications. You might have coded the names of all medications being taken for each patient case in your study, yet still fail to find a definitive link showing that a particular drug was consistently present in the adverse reaction cases. One approach might be to create a new data source by listing the names of all ingredients making up the individual medications included in your original data set. By doing this you would add a level of detail to the analysis that might then permit you to discover that it is actually a combination of ingredients that produces the adverse reactions.

We have often found it useful to generate new data sources to be used as supplemental information during an analysis. In one particular data mining engagement we manually created a list that categorized various military units according to a set of unique codes which we analyzed in terms of requests for certain types of support resources. In another application for law enforcement, we created a list of vehicle types and their corresponding price ranges to help determine certain income-to-asset ratios, which were then used to expose noncompliant tax filing patterns. Simple lists such as these may be added as extra data sources and can contribute enormous value and insight to the data mining process. Since data mining is an iterative process, sources can be introduced at any time throughout the entire engagement. This occurs frequently because as you progress in identifying patterns, you realize that there are certain dimensions that may be missing. The easiest way to resolve this is to create a new data set that contains the information needed.

Data Mining as Problem Solving
From this discussion you can see that the success of a data mining engagement can depend largely on the amount of energy and creativity that you bring to the table. You will get as much out of a data mining analysis as you put into it. In essence, data mining is like solving a puzzle. The individual pieces of the puzzle are not complex structures in and of themselves. However, taken as a collective whole they can constitute very elaborate systems. As you try to unravel these systems you will probably get frustrated, start forcing parts together, and generally become annoyed at the entire process. However, once you know how to work with the pieces, you realize that it was not really that hard in the first place.

The same analogy can be applied to data mining. At this point in time, you probably do not know much about the details of your data sources. Otherwise you would most likely not be interested in performing data mining. Individually, the data records seem simple, complete, and uncomplicated. But collectively, they take on a whole new appearance that is intimidating and difficult to comprehend, like the puzzle. As you will see, being an analyst requires creative thinking and a willingness to see problems in a different light. You cannot expect automatic answers. One of our goals in writing this book is to encourage analysts to think outside the box, letting their imaginations wander. Once you delve into the problem and are able to discover some interesting patterns and trends, data mining becomes less awe-inspiring and more routine.

An Overview of Section I
In this section we cover a number of issues that will expand the ways in which you think about your data and the types of analyses that might be most useful to you. In Chapter 1, "What is Data Mining?," we describe the data mining process and give examples of applications in which data mining methods have been used successfully. A large portion of this chapter is devoted to issues that should be addressed before any data mining activities actually begin. Chapter 2, "Understanding Data Modeling," provides an introduction to the modeling process, including a discussion of object-oriented modeling techniques. We describe the basic differences between descriptive and transactional models, giving examples of each. The importance of choosing models wisely is illustrated in the discussion of the distinctions between intraand inter-domain pattern detection. Last, in Chapter 3, "Defining the Problems to be Solved," we present a set of conceptual frameworks within which you can consider your particular application problems. Following this, we describe the differences between reactive and proactive modes of analysis, and how you can best use these in your own engagement. We hope that this section opens up your mind to new possibilities in terms of ways to approach your own data.

WHAT I S DATA MI N I N G? 1
Data mining is one of the fastest growing fields in the computer industry. Once a small interest area within computer science, it has quickly expanded into a field of its own. Though you may have heard descriptions of data mining techniques, been exposed to knowledge discovery in databases KDD, or read reports of successful applications, chances are that you may still have some basic questions about what data mining is and why is it useful. The purpose of this chapter is to clear up a number of uncertainties you may have regarding the definition of data mining and to provide some examples of applications in which data mining methods have been used successfully. We go on to describe the boundaries that set data mining apart from other information technology approaches. Finally, we present some practical advice that you should bear in mind as you begin any data mining engagement.

Data Mining Defined
How do we explain what data mining really is? Let us begin with a few basic facts.
° Many organizations ranging from private businesses to government bureaucracies have devoted a tremendous amount of resources to the construction and maintenance of large information databases over recent decades, including the development of large scale data warehouses.
° Frequently the data cannot be analyzed by standard statistical methods, either because there are numerous missing records or because the data are in the form of qualitative rather than quantitative measures.
° In many cases the information contained in these databases is undervalued and underutilized because the data cannot be easily accessed or analyzed.
° Some databases have grown so large that even the system administrators do not always know what information might be represented or how relevant it might be to the questions at hand.
° It would be beneficial to organizations to have a way to "mine" these large databases for important information or patterns that may be contained within.
° There are a variety of data mining methodologies that may be used to analyze data sources in order to discover new patterns and trends.

There you have it. This is what the general idea of data mining is all about. Unlike situations in which you might employ standard mathematical or statistical analyses to test predefined hypotheses, data mining is most useful in exploratory analysis scenarios in which there are no predetermined notions about what will constitute an "interesting" outcome. Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Usually you begin by getting an overall picture of the available data. This is followed by a series of steps in which subsets of the data are modeled and analyzed. Based on the discovery of interesting patterns, there may be subsequent resampling of the data set along with the formulation of new models designed to emphasize particular aspects of the data, and so forth. You might continue iterations down one path for a time, and then retreat back out to a higher level and begin again with another modeling approach when the first path is exhausted. Data mining may occur within a single data source or across multiple sources. Whatever exact form the analysis takes, the key is in adopting a flexible approach that will allow you to make unexpected discoveries beyond the bounds of the established expectations within your problem domain. The most important strategy to keep in mind for data mining is to keep your options open--there are many occasions when interesting discoveries may be made only when the data are approached from multiple perspectives.

There are many technologies and tools available for data mining applications. From our perspective, there are certain technologies that have better track records than others in terms of ease of use and return on investment. Nevertheless, despite all of their attractive bells and whistles which we describe in detail later, the tools alone will never provide the entire solution. There will always be the need for the practitioner to make the important decisions regarding how these systems will be employed. This analyst must decide how best to manipulate, exploit, and expose critical patterns and relationships in the data by using a combination of techniques. There are general lessons and principles that may be commonly applied to all application areas. No matter which data mining tool s the analyst may employ for a given engagement, the analyst will want to be guided by these general principles in deciding how to construct models so as to get the most out of the data mining exercise.

Using Data Mining to Solve Specific Problems
One of the greatest strengths of data mining is reflected in its wide range of methodologies that can be applied to a host of problem sets. Since data mining is a natural activity to be performed on a data warehouse, one of the largest target markets is the entire data-warehousing, data-mart, and decision support community encompassing professionals from such industries as retail, manufacturing, telecommunications, health-care, insurance, and transportation. In the business community, you can use data mining to discover new purchasing trends, plan investment strategies, and detect unauthorized expenditures in your accounting system. Further, you can apply data mining to improve your marketing campaigns, using the outcomes to provide your customers with more focused support and attention. As another example, you can apply data mining techniques to problems of business process reengineering in which the goal is to understand interactions and relationships among business practices and organizations.

Many law enforcement and special investigative units whose mission is to identify fraudulent activities and discover crime trends have also used data mining successfully. Data mining methodologies can aid analysts in the identification of critical behavior patterns, in the communication interactions of narcotics organizations, the monetary transactions of money laundering and insider trading operations, the movements of serial killers, and the targeting of smugglers at border crossings. Data mining techniques have also been employed by people in the intelligence community who maintain many large data sources as part of activities relating to matters of national security. Four examples of areas where data mining has been applied successfully follow. p>Improved Marketing Campaigns

Marketing programs can cost a company a significant amount of money in terms of the design, production, and distribution of materials. If a marketing campaign is not designed for the appropriate client base, the response to the offering might suffer, not only in terms of the expenses required to produce the campaign but also in lost sales. Additionally, if distribution of marketing materials is not handled correctly, the campaign might not be as effective as it might otherwise be. Inconsistent data sets that include wrong names, outdated addresses, duplicated records, and other incomplete data fields will most often produce nonresponsive marketing targets. Clearly this is a waste of marketing funds and company resources.

Losses on marketing initiatives were such a problem in one automobile industry segment that some companies used to include a fee about $500 or 3-5 percent of a car's price for every car sold to help recover these marketing costs. It is perhaps fortunate news, then, that there are now several examples within the automotive industry in which data mining has been used to help improve this marketing process. The effectiveness of infusing data mining into marketing campaign design can be measured in terms of the observed response to the improved campaign. Companies have saved literally millions of dollars by better managing their marketing concerns using data mining techniques.

We performed an analysis of the new car sales database for a local car dealership to determine whether specific problem areas could be identified in their marketing activities. Almost immediately we discovered several situations that the dealership was able to act upon quickly. For example, we found that many of the addresses maintained for car owners were assigned to the post office box of the credit agency financing many of the vehicles. Obviously these addresses were recorded incorrectly from sales transactions and were useless for marketing purposes. Since this single dealership was spending tens of thousands of dollars each month marketing their clients, it was important for them to have a well-focused ad campaign, and the resources spent sending materials to these wrong addresses were completely wasted. On a grander scale, you can imagine what the larger retailers spend for their marketing efforts. Mailings, flyers, and letters can run several dollars a copy after the design, printing, postage, and handling costs are included. Thus, sending thousands of marketing packages each month to clients with an improper address, such as was discovered at this dealership, is a waste of time, resources, and money. Identifying this pattern helped the dealership clean up their data and do a better job at marketing to their clients.

Sometimes the discoveries that you make during a data mining engagement have narrower ramifications that nevertheless turn out to be useful. During this same analysis for the car dealership, we discovered several groups of individuals in the data set who shared the same business phone number. The client acted upon this information and designed a special marketing ad for these groups, offering them free rides to and from work if they scheduled dealership services e. g., oil change or routine maintenance on a specific day. The dealership was able to maximize the use of its shuttle-van operations while satisfying a select set of customers through a very focused offering. This was an unexpected result, and it illustrates how serendipitous discoveries can be put to good use.

Improved Operational Procedures
Another arena in which data mining can improve the way an organization functions is within the operations and communications of the organization itself. The organizational structure itself is, of course, one source of information. Organizational structures are typically very hierarchical and rigid. The CEO is at the top level with several supporting VPs at the next level down, and so on see Figure 1.1 a. But does this sort of hierarchy really give a complete picture of how an organization functions in operational terms? Usually not. Drawing on social net-work theories, there are all kinds of subnetworks and informal groups contained within any organization. These subgroups might be defined by how often individuals communicate with others in the group, who works with whom on projects, or where individuals go within the organization to get advice. Taken collectively, all of these form the operational structure of a company, as shown in Figure 1.1 b.

This information is not usually contained within any existing corporate database. Rather, it must be collected through well-engineered questionnaires and communications analyses. Were this information to be collected and analyzed, it could be used to derive a realistic model of the way an organization functions. In cases where this has been done, clients report that this type of information allows them to determine which key business processes are actually being hindered, rather than supported, within the current structure. We have seen many examples of unbalanced interactions among departments within companies that are detrimental to the overall performance of the organization. These include marketing and sales departments having very little interaction with one another, technical engineers dominating the advice-networks of the management staff, and top management in accounting having almost no direct involvement with the rest of the company. Discovery of these unhealthy situations can lead to restructuring that will allow your company to achieve its stated goals.

Identifying Fraud
Corporate security offices and law enforcement agencies have been applying data mining technologies to their data sets for quite some time. They have analyzed all sorts of data sets including telephone toll calls, narcotics operations, financial crime enterprises, criminal organizations, border crossings, street crime patterns, gang relationships, terrorist activities, tax evasion, embezzlement, insider trading, and a wide range of other activities.

In one particular data mining application that we conducted, the goal was to investigate the operations of a suspected money laundering operation. Money laundering is a term applied to any process that is used to take the monetary proceeds of illegal activities and transform them into assets that appear to have been obtained by legitimate means. Profit is the principal motive in such criminal activities, and usually involves cash transactions. In today's marketplace, money laundering is detected most easily at the point in which the funds enter a legitimate financial system.

Regulated financial institutions such as banks, savings and loans, and credit unions are required to comply with the federal statutes imposed on them as stated in the Money Laundering Control Act of 1986. These institutions collect various types of information about cash transactions that are over $10,000 including names, addresses, identification numbers, accounts, and amounts involved. The members of the particular group that we were helping to investigate were all foreign nationals and the monetary proceeds of their money laundering activities were suspected of being tied directly to narcotics operations.

The investigating organization had access to all of the forms listing the cash transactions of this particular group. It soon became apparent that one of the methods they were using to launder their money was to mix the funds in with the legitimate proceeds of a restaurant being run as a front company. Starting with a list of names provided by the law enforcement agency, we were able to identify an ongoing loan-back scheme. The loan-back scheme begins when a corrupt organization takes out a loan to cover business expenses. Collateral for the loan is usually property procured illegally with the original drug money. The approval of the loan serves to legitimize the source of the funds used to purchase the collateral property. The loan is then paid back using "dirty" money. We were in fact able to demonstrate this chain of events by applying data mining techniques to data sets containing information about the financial transactions of the individuals involved.

The investigation also exposed various illegal layering practices also called structuring of financial transactions. Recall that by law financial institutions must report any cash transactions of $10,000 or more. Money launderers often try to move money through a series of separate transactions, all falling beneath this $10,000 threshold. In this case we discovered numerous $9500-$ 9900 deposits made at various banks on the same days by members of this organization see Figure 1.2. As it turns out, this form of structuring transactions to avoid documentation is a Federal crime that carries severe criminal penalties.

Examining Medical Records
Understanding medical data within a military environment can be particularly useful, especially when it pertains to the disposition and overall health of the troops. As a matter of fact, over 80 percent of the illnesses and incapacitation reported during wartime result not from physical injuries received on the battlefield, but from dysentery, infection, and disease. Planning for treatment of these conditions ahead of time can facilitate medical response and treatment during critical situations. In a medical application we performed for the U. S. Government, we analyzed data records of medical reports and diagnoses made onboard a set of military ships. Using data mining methodologies, we were able to identify several interesting patterns and trends. As an example, we discovered a relatively high incidence of chicken pox among young recruits between the ages of 17 and 19. As you may know, chicken pox in adults can be quite a serious health matter and identification of problematic subgroups within the enlisted population can facilitate the establishment of policies and procedures aimed at minimizing this health threat. Another pattern revealed in our data mining analyses involved the detection of a specific secondary respiratory neoplasm e. g., cancer occurring in a set of military units whose soldiers all originated from the same recruiting location. The cases observed all had a particular set of attributes in common that allowed us to identify them as a group. Once the group was characterized, the client was able to perform further investigations of that particular location to determine whether the elevated incidence of this form of cancer could be attributed to the presence of chemicals or ot her carcinogens in the area.

What Data Mining Is Not
It is important always to bear in mind that the focus of the data mining process is to discover hidden patterns and trends. Once a particular pattern has been identified, it may contain certain characteristics that prompt the data mining practitioner to move forward along a path of further discovery. However, once that particular pattern is identified, it can be described as a known quantity. The pattern may be put to a multitude of uses including becoming the content for a standard report, serving as the training input to a neural network, or being encoded as a rule into an expert system. At this point, the process of discovering that particular pattern is finished. From the perspective of the data mining process, it may be regarded as a known pattern. Further inquiries about known patterns are made only if there is a need to confirm whether they are still valid or if variations of the patterns should be considered. Analytic approaches that search data sets on the basis of known patterns are not doing data mining, although they may use inputs from data mining exercises to form the basis of target matches. For this reason, we do not regard techniques that require implementation of rules, predefined training examples, or automated supervised learning to be data mining approaches. This of course does not mean that those techniques are not useful in many instances; it simply means that, in our opinion, those processes do not constitute data mining.

Analysis versus Monitoring
This brings up a distinction that is important to make in information processing between analysis and monitoring. The majority of data mining applications are focused on analyzing information that has been previously collected. For these cases, the data are static and represent the state of the world in some past interval of time. You may review the information at your own pace, confirming the accuracy of the data, making considered decisions about which patterns are important. The data are not changing while the analysis is being performed. The results generated will therefore be reliable and consistent for that data set. Within reason, you do not concern yourself with the amount of time it takes to make a decision. Rather, you are in the mode of discovering new patterns and are more interested in following hypotheses wherever they may lead.

In contrast, monitoring often involves online pattern matching operations in which incoming data are compared against a set of conditions or boundaries. Monitoring often occurs in real time and involves the processing of data that are continually being updated. What is true one moment may abruptly become out-of-date and invalid the next moment. Monitoring systems have been developed for such application areas as financial markets, air traffic control centers, and nuclear reactors. Monitoring systems often make quick responses in order to take advantage of information as it is being presented. Thus, predictive models and forecasters can be used to help identify critical values, unusual behaviors, and criteria data. These systems are not usually performing data mining since they are not discovering new patterns or classifications. In most cases, the patterns of interest have been identified and generated during a previous analysis and the monitoring process involves detection of matches or violations of those patterns. True data mining is difficult, although not impossible in these types of environments. Several of the tools described in Section III support real-time data feeds and can achieve this level of analysis.

The following are publicized examples from industry that have been characterized as data mining applications. However, as you will see, they do not really fit the definition of data mining since they are dedicated to monitoring rather than discovery of new patterns and trends. They perform pattern matching rather than interactive discovery. Remember to keep in mind that once a pattern has been identified, it can be easily encoded into a rule or report which can then be run across a large quantity of information using other matching, correlation, or classification techniques. Once this stage occurs, data mining is finished and a new type of analysis has begun.

Monitoring Credit Card Transactions
As we all know, fraud is rampant within the consumer industry. In response to this problem, credit card companies have created elaborate systems to curb the misuse of their services. Many of us have been called at home by a credit card company representative inquiring about certain expenditures that do not appear to fit their normal client profile. The more unfortunate among us have endured the unpleasant experience of having a purchase interrupted by the revelation that our card actually has been suspended by the company based on an unusual pattern of recent buying activity. Credit card companies have a lot at stake and it is important for them to discriminate between good and bad transactions. What constitutes a "bad" set of transactions in this context? One example that is used by a large credit card company as a trigger for further analysis is the use of a card to make gasoline purchases more than a certain number of times in a given 24 hour period. As it turns out, criminals who steal credit cards often want to test the cards initially to determine whether they are valid before using them for a large purchase. One particular pattern that has been discovered is that credit card thieves will go to a gas station with pay-at-the-pump service so they can swipe the card to see if it has yet been reported as stolen. This initial test is structured to allow the thief a quick getaway in case the card gets identified as being lost or stolen. Once the thief has confirmed that the card is still operational, he or she usually visits another gas station, fills up the tank, and starts using the card for other purchases. So a small gasoline purchase quickly followed by a series of other gasoline and produc t purchases is often flagged as a questionable pattern that is identified for further inquiry.

Note that in this case the pattern of gasoline purchases is not being "discovered." Rather, this pattern is already known and is matched against incoming charge data. In cases such as these, the patterns of interest are discovered off-line using data mining analyses and are then encoded into the systems for future classifications and matches, usually by a neural network. There is no actual data mining occurring within these online systems, only simple classifications, profile matching, or value boundary exception handling. All the hard work and fun of discovering and formulating the profile of the suspicious behavior was performed elsewhere and subsequently encoded into the matching system. Just a word for the average credit card user--the credit companies know that one corrupt merchant can do much more damage than several individual card holders, partially due to credit card spending limits. Therefore, the companies often devote more of their resources to detecting merchant fraud than individual consumer fraud.

Monitoring Medical Billing Fraud
One of the most common services provided by medical billing review companies is the ability to detect CPT Current Procedural Terminology code unbundling. CPT codes are an established list of five-digit numbers used to identify the medical procedures and services provided by physicians. The unbundling problem occurs when doctors submit their bills listing charges for routine procedures that should be classified under a specific CPT code, but instead are broken up and filed as a combination of several separate CPT codes. Physicians do this because they can get more reimbursement dollars for the sum of the individual procedures than for the single composite service. This behavior is illegal and constitutes insurance fraud, but it occurs on a regular basis. Companies performing review of medical billing for insurance purposes often claim to perform data mining on the submitted bills to look for these filing patterns. In actuality they usually are running some low-level expert systems or neural networks that have been programmed to look for specific types of known patterns. Again, the data mining involving the discovery of the suspicious patterns initially occurred off-line and the known patterns were incorporated into a set of rules to be matched automatically against online data.

Marketing with Coupons
Think back to the last time that you bought groceries at the supermarket. Along with your receipt you may have received store coupons for items you purchased that day. You might also have noticed that you received coupons for items you did not purchase, but which might go well with some of the items in your cart. For example, you may have bought a case of soda and received a coupon for potato chips. How would the computer know to generate a potato chip coupon when you did not purchase potato chips that day? Is data mining being performed online as your groceries are being scanned? No. The discovery of sets of items that consumers tend to purchase together is a data mining activity that occurs offline using large data sets comprised of thousands of purchase transactions. Some supermarket chains spend a good deal of time looking at their data in order to understand these purchasing patterns more completely. Stores can take advantage of these known patterns by giving out coupons or placing associated items close to one another in the store in order to increase sales. When you go through the checkout line and receive coupons based on your purchases that day, the computer routine that selects your coupons is merely performing pattern matching on previously discovered patterns.

Avoiding the Oversell
It has been estimated that data mining services will become a $20 billion industry by the year 2000. As always the promise of an emerging field of technology brings excitement, but also the tendency on the part of some to make grandiose claims beyond the scope of the capabilities of the technology. Who can forget the claims in the 1980s that there would be a neural network in every toaster by now? Does anyone remember the promises of automated high-level language processing made by some in the field of artificial intelligence? Clearly these prophecies have not come to pass. This is not to say that either of those technologies was flawed--in fact, each has produced many wonderful innovations that have been a great help in a variety of application areas. The problem was in the overselling. Unfortunately the field of data mining will be no different.

Already we are bombarded with a host of new buzzwords that sound great, but in reality are old ideas being touted as part of the new data mining technology. Let us state from the outset that data mining is not just running correlations, statistics, or a set of sorted reports on a data set. Rather, data mining is a process of uncovering new patterns and trends within data that would not necessarily be revealed through traditional methods of analysis. Data mining is interactive discovery. You need to bear these caveats in mind as you decide whether or how to incorporate data mining into your arsenal of information technologies. You need to be well informed so that you can use this technology to your organization's full advantage and not be misled into adopting approaches that are not appropriate for your application.

Data mining is a very unique and challenging process. Although consistent principles regarding the data mining process may be identified, to some extent each data mining exercise has its own defining characteristics. There is currently no road map to follow for performing data mining, no cookbook of directions that can anticipate all starting conditions and guarantee a successful outcome. Anyone claiming to sell you a system as a silver bullet that will automate an analysis and always produce the desired answer should be regarded with caution. If it sounds too good to be true, it probably is. In actuality, the process of data mining almost never involves simply running a single application on your data set. Rather, it is a process in which the data mining practitioner utilizes combinations of technologies and methodologies. The person performing this task must be able to think creatively and be flexible in approaching a problem. Since data mining is a very iterative process, the practitioner will not simply be repeating a known scenario, but will constantly refine his/ her approach based on outcomes or patterns discovered along the way.

Practical Advice before You Begin
The field of data mining shows exceptional promise in terms of its potential contributions to a host of analytical applications. We have seen our share of successful applications that have provided a significant return on investment. Presumably you have purchased this book in anticipation of learning more about this technology and perhaps using it in applications of your own. Before you begin, however, we would like to offer some cautionary words of advice on some real-world issues that can limit the utility of data mining engagements unless addressed directly

Justifying the Data Mining Investment
How do you determine whether investment in a data mining system is justified? In some cases the math is fairly straightforward. Consider the amount of money an organization expends on marketing and service-related activities compared to what it gets in return. Given the difference between these numbers, you can derive a reasonable estimate on how much you would expect to improve this bottom line using data mining technologies see the following sidebar. If there are methods that can be used to expand your marketing campaigns, to reduce fraud, or generally improve your profits, and the amount of improvement exceeds the cost of implementation, then consider it a good investment. Based on our experiences, companies usually look for the investment made in data mining to be about 15-20 percent of value of estimated losses or expected improvements made.

Evaluating Return on Your Data Mining Investment:
A Sample Analysis
Sometimes an organization will have estimates of how much money is being wasted on expenditures. For example, in one data mining engagement our customers believed that there was a certain amount of fraud within the operating environment. They were spending over $350,000,000 each year servicing the merchandise that they produced and were receiving approximately 1,800,000 service claims per year. Using some simple arithmetic, the company was spending on average about $194.44 for each claim filed.
Clearly the identification of fraudulent claims within this context could quickly add up to big savings. The client estimated the percentage of fraud to be anywhere between 3 and 8 percent. Thus, the amount of money being paid for fraudulent claims every year was calculated to be between $10,500,000 and $28,000,000. If we take the middle of the road at 5 percent, this equates to $17,500,000.
Using data mining techniques, we were able to identify several critical patterns of fraud for this client very quickly. In this particular situation, one of the patterns found was based on the replacement of parts that were either removed, misplaced, or stolen while in possession of the distributor before the merchandise was actually sold. Since the part was a basic feature of the product and was both a functioning component as well as a cosmetic feature, customers who eventually purchased the product wanted the part intact. The distributors did not feel obligated to replace the part since the product had not yet been sold when it was lost or stolen. Thus, the distributors decided to charge the cost of replacing the part back to the company as a basic warranty repair. Since these claims were relatively minor charges as compared to other types of work being performed, this fraud scheme was hidden easily among the larger number of legitimate warranty repairs.
Our discovery of this scheme put our clients in a position to save quite a bit on wasteful expenditures. They calculated that the resources spent on this data mining engagement provided at least a 50-to-1 return on investment. This calculation does not even take into consideration the future reduction of losses due to the identification of these patterns. This was not a bad return for a relatively small-scale investment.

Virtually any organization involved with communication, retail, insurance, finance, commerce, or transportation activities has areas of vulnerability in which fraud can occur. The previous example was just one aspect of one area within that particular company. Many frauds go undetected for years because they are hidden carefully among large numbers of normal business dealings. No wonder our insurance rates, car prices, and medical costs are so astronomical. As consumers, we are forced to cover the fraud, waste, and abuse of services through increased premiums and costs of products. This equates to billions upon billions of dollars lost every year by commercial businesses to fraudulent activities.

In many cases, fraud, malpractice, and malfeasance succeed because people do not know how to interpret their data sets or recognize the telltale symptoms. The stories we read in the newspaper about investors or accountants running off to Tahiti with large quantities of money are special cases and thankfully do not occur very frequently. So you should not be surprised if there are no million-dollar patterns exposed when you first apply data mining to your environment. The majority of theft and fraud is carried out in a large number of relatively small exchanges. Thus, instead of perpetrators trying to take advantage of organizations for millions of dollars at a single time, fraud usually is achieved through a series of frequent claims or transactions with relatively small amounts of money being stolen on any one occasion. Over an extended time period an organization may pay out millions replace the part since the product had not yet been sold when it was lost or stolen. Thus, the distributors decided to charge the cost of replacing the part back to the company as a basic warranty repair. Since these claims were relatively minor charges as compared to other types of work being performed, this fraud scheme was hidden easily among the larger number of legitimate warranty repairs. of dollars, although not in one single chunk. This sort of fraud is of course subtle and not directly detectable through usual methods of oversight. Data mining approaches can be applied to these sorts of problems with great success at relatively low cost. The investment in data mining usually is repaid quickly with multiplicative returns when applied to these types of problems.

Working Efficiently: Timeliness Is a Virtue
A data mining engagement should not be a lifelong commitment. Although some engagements can go on for an extended period of time, you should expect to see tangible results within a period of days or weeks at the most. The only barrier that should reasonably bar you from this goal is lack of access to data sets. If you do not see discernible results within a reasonable period, it is time to go back to first principles. Perhaps the data mining tool being used is too limited in the features it provides. Perhaps the data are not being modeled in the most effective way. Perhaps the scope of the analysis is too broad or too narrow. Perhaps the whole analytical approach is inappropriate for the problem at hand. Or, finally, perhaps there just are not any interesting or surprising patterns in the data set.

On rare occasions we have analyzed data sets that contained no interesting pat-terns. This was due either to poor selection of the data extracted for analysis or to poor quality control in the original collection process. Typically these situations can be identified in advance and avoided, especially if you can help guide and control the initial identification and selection of the data to be used in the engagement. We have seen some self-proclaimed experts use the "bad data" excuse inappropriately. They often plead this defense either to stall for more time or to request additional data in the hopes of eventually producing an interesting result with the wrong methodology. In either case, the data mining practitioner should be able to explain and demonstrate the reasons behind a failure to produce usable results.

The beauty of data mining is that you begin to see patterns almost immediately if you apply the methodologies properly. In most of our data mining experiences, we have confirmed patterns of interest to our clients within several days of the start of the engagement. In one particular engagement, it took less than four hours to completely build and test models once we had the data. Furthermore, after producing our initial results we were able to reconfigure our models on-the-fly to search for other patterns of interest to the client. Does this mean that we are exceptionally brilliant people blessed with otherworldly intellect? We would like to think so, but probably not. More likely it is that we chose the most appropriate approach to the problem and so were able to produce the most usable results in a timely fashion. Remember, it is not always the army with the best weapons that wins the war, but the army that knows how best to use the weapons at hand.

Establishing the Limitations of Your Data Resources
Before you begin an engagement you should ascertain whether there are indeed sufficient data sources available to make the effort worthwhile. In the worst case, investigation of this question may reveal that very little of the critical information is coded into electronic format so as to be accessible to analytical tools. We noted one example of this problem during a counter-terrorism application performed for a U. S. Government agency. In one office that we dealt with was a lovely woman nearing retirement age who had a wealth of knowledge about the operations of a wide range of organizations important to the agency and, of course, to our national security. Stacks upon stacks of hard copy files were piled high on her desk and she knew exactly where everything was located. She could pull out any piece of appropriate information required to respond to a situation faster than you could conceive of doing it electronically. Perhaps you can think of similarly indispensable people within your own organization whose safety is no doubt prayed for every night. This is fine as far as it goes, but for obvious reasons it is preferable that the information be coded electronically and be made accessible to more than one individual in the group.

Before beginning any data mining exercise, you will need to determine from the outset that this type of roadblock will not hinder your progress. You must have accurate, well-coded, and properly maintained information in order to produce reasonable results. Additionally, you must make sure that the organization is going to give you permission to access all of the information that you will need to perform the analysis. If the organization has not already made an investment in this technology before you begin, it does not bode well for your chances of success. The two things we generally hope for are that the data are represented in an electronic format and that they are made available to the analyst. Once these two hurdles are overcome, at least some degree of data mining can usually be performed.

Keep in mind that you do not necessarily need online and interactive access to the data sources. In most cases, the data mining is not done in real time. Therefore, static extractions of data will satisfy the requirements of most data mining applications. In one particular engagement that we performed for a state agency, we needed to collect the real property/ assets records for a particular county. After several phone calls we jumped in a car, drove over to the county court records facility, and picked up a nine-track reel tape containing our requested information. This was subsequently loaded into our computing system as a local data set and we were able to perform our analyses successfully. Since property records are fairly stable and do not change significantly on a daily basis, we were able to use the information to produce reliable results for the better part of a year.

Defining the Problem Up Front
When performing data mining, you need to have an understanding of what things are of interest or importance. This allows you to set the boundaries of the problem space. If you set your focus too narrowly, you will miss the objective. Of course it is possible to err in the other direction as well, and in fact this is the more common mistake. Typically what happens with data mining projects is that the original scope is often very generalized and nonspecific in its definition. Sometimes in these cases the client may send you on a fishing expedition to find something "interesting." This is not necessarily a problem if the analyst is careful to narrow the scope continually as the engagement proceeds. One way to approach problem definition is to consider and discuss hypothetical examples with your client before analysis begins. By devoting time to these exchanges in the early stages of the process, you can develop a more accurate sense of the sorts of findings that are likely to be of interest in a particular application.

We have found that clients initially like to have a quick, definitive success in a data mining engagement before committing any additional resources to the project. As proof-of-principle successes are provided and the client becomes more educated about the potential usefulness of the data mining approach, new and more ambitious analyses will be requested. Thus, if successful early on, the data mining process will likely become iterative in its development. What usually happens is that you may start by looking for large-scale patterns that confirm that the approach selected is valid. Once the initial results have been presented, a more directed and focused effort can be initiated.

By taking the analysis in stages, hopefully you will avoid the pitfall of setting objectives that cannot be met. One mistake that is often made by analysts is in promising more insights and results than can be produced in a single engagement. If the application area is a complex one for which there are many classes of questions to be answered, you are well-advised to break the problem up into component parts. The best approach is to do a series of smaller-scale applications that might eventually be combined into a final system once useful results are produced. Do not bite off more than you can chew. Start small and make additions to the system only if they add value to the application. Your customer will see the potential of what is being done and will appreciate your approach.

Knowing Your Target Audience
One important issue that you will want to consider when formulating your approach to an engagement is the composition of the target audience. Who will be the recipient of the results? What will the results represent? What are the repercussions of the data mining engagement likely to be? Always keep your target audience in mind when performing your data mining activities. The approach used to solve the problem must satisfy the intended recipient. The data mining practitioner needs to know if the results are going to be used for internal review, informational purposes, formal presentations, or official publications. There are a wide range of issues to address when determining the target audience.

We have worked with everyone from high-tech computer programmers and intelligence analysts to corporate executives and members of a jury. Although we will not discuss presentation techniques until Chapter 5, we will point out that the degree of detail that is appropriate will vary among different types of target audiences. In some cases, the audience just wants a general overview of where, who, and how much. In other cases, the recipient will want to see every detail regarding the analytical process. Some audience members may even want to become collaborators in the analysis and may suggest alternative representation strategies, different analytical models, and what-if scenarios. You need to be careful and select the correct mixture of methods and techniques to match the requirements of your target audience.

In one application that we did for the banking industry, we had an interesting interaction with the corporate personnel who were responsible for collecting the data used in the analysis. By presenting the results in ways that made sense to them, they were able to contribute insights into how these data could actually be used in future data mining applications. This helped to increase their confidence in the process because they understood how their efforts were feeding into an analysis that had practical implications. We have also seen similar reactions in other industries where the results of the data mining efforts were used to justify existing data collection activities and, more importantly, help guide future collection efforts.

In another application, we were working with a set of non-technical lawyers could there be any other kind? on a grand jury case involving the prosecution of methamphetamine dealers. In this application we had two target audiences. Our ultimate target audience was the jury. However, in the early stages of the project we had to work closely with the prosecuting attorneys and educate them about the analyses and presentation formats being used. In this particular case, we were looking at the telephone call patterns of the defendants--about 120,000 records. The attorneys needed to feel comfortable with the fact that the data were being accurately depicted and that nothing was being changed or altered in its meaning during our analysis. Once they accepted that our representation of the data was valid, the next step was to coach them on the different types of analyses that could be performed. Much of what we were doing was completely new to them, and they were not sure initially about what questions to ask. Since they were the ones who would eventually explain, justify, and defend the diagrams e. g., patterns to the jury, it was imperative that the information being presented be crystal clear to them. We started with very simple analyses and proceeded slowly. The more exposure they received, the more confident they became. When all was said and done, they had a very successful trial and will now be likely to use similar data mining analyses in future cases.

Anticipating and Overcoming Institutional Inertia
Because of the intransigence of established institutional policies and practices, it may be difficult for an organization to act on the results of data mining analyses, even when those results are quite dramatic and have serious implications for ongoing operations. During one data mining engagement, we worked for a major insurance company, helping them look for patterns of fraud within their claims database. This particular engagement was focused on the detection of corrupt doctors and lawyers who were submitting fraudulent claims for reimbursement. The insurance company personnel in charge of supporting this effort were from a Special Investigative Unit SIU within the company comprised of people with backgrounds in law enforcement, data analysis, and computer science. As the engagement progressed, we quickly discovered that fraud was rampant. Soon we were identifying high-value targets faster than the company could deal with them under their established procedures. Although there were literally millions of dollars at stake, the company was not in a position to expand the SIU and provide necessary resources to follow up on many of these leads, nor were they willing to change existing filing procedures to reduce some of the fraud. The rationale given was that they had identified enough targets to keep them busy for the foreseeable future and would focus only on the extreme cases. The company regarded the fraud as an annoying but calculated overhead cost.

Lest you think this is an isolated example, consider these events that occurred in a separate data mining engagement. We were using a database representing a large portion of the client base for a life insurance carrier and were looking at benefits paid for death claims. During the early stages of demonstrating a prototype data mining application, it became apparent that there were numerous people who had multiple claims submitted to the company. These individuals were not the family members or relatives of the deceased, but the actual or shall we say alleged people who were supposedly deceased. This was obviously a clear-cut form of insurance fraud. Nevertheless the company was not willing to endure the effort and expense of changing their existing policies to avoid these situations. Further, the company opted not to pursue many of the fraudulent claims identified since they believed that the cost of prosecuting the cases would have exceeded the returns.

Many readers may find these stories difficult to contemplate. However, take a step back and think about how your organization would respond if a set of threat patterns e. g., fraud, embezzlement, and process improvement were suddenly identified. If the problem is circumscribed you might be able to solve it by purchasing hardware, installing some new software, or changing specific vendors. In all likelihood, though, the problem would be more widespread and efforts to solve it would require changes in organization, personnel, and policy that are slow to be realized. Also, it would have to be determined whether the investment in these changes significantly offsets the cost of the damages or improvements identified in the first place. Thus you should be prepared to be realistic in terms of the benefits that can be derived by the use of data mining technology. Since our involvement with the different insurance companies, they have made some significant improvements and progress is occurring, albeit slowly.

In describing these events it is not our intention to discourage anyone from doing data mining analyses. Quite the contrary. As we will show throughout the book, data mining has been used in any number of problem sets to great effect. We simply present these vignettes as a reminder that, no matter how powerful an analysis, it will only be successful to the degree that its results may be put to some use. Thus, in making the decision to use data mining or any other form of analysis, you should give consideration to the types of data available for analysis and the types of outcomes that will be most useful within the context of the particular application area.

Summing Up In this chapter we have given you a definition of data mining, provided some examples of the successful use of data mining in various application areas, and given examples of analytical information processing techniques that might be misconstrued as data mining approaches. In addition we have provided some cautionary information that you will want to consider before beginning any data mining engagements so that you can avoid potential pitfalls that might lie along your analytical path. Now that this background context has been established, you are ready to begin thinking about the mechanics of the real-world data mining exercise. In the next chapter we move into a more technical discussion of the process of data modeling which takes place at the start of all data mining analyses.

Read More Show Less

DEFINING THE DATA MINING APPROACH.

What is Data Mining?

Understanding Data Modeling.

Defining the Problems to be Solved.

DATA PREPARATION AND ANALYSIS.

Accessing and Preparing the Data.

Visual Methods for Analyzing Data.

Nonvisual Analytical Methods.

ASSESSING DATA MINING TOOLS AND TECHNOLOGIES.

Link Analysis Tools.

Landscape Visualization Tools.

Quantitative Data Mining Tools.

Future Trends in Visual Data Mining.

CASE STUDIES.

Mapping the Human Genome.

Telecommunication Services.

Banking and Finance.

Retail Data Mining.

Financial Market Data Mining.

Money Laundering and Other Financial Crimes.

Appendix.

What's on the CD-ROM.

Index.

Read More Show Less

Interviews & Essays

From the Author

Visual Data Mining. In creating this book, we felt it important to cover an area of data mining that is quickly becoming the preferred method of choice for discovering patterns and trends -- visualization. The use of visualization provides benefits that support fast training, rapid application development, excellent pattern detection, and most importantly a quick return on investment (ROI). Many other data mining approaches can't make these types of claims. Our experiences speak for themselves.

This is a book about visual data mining techniques and technologies. Although the topics are briefly discussed and generally useful, this is not a book about statistical testing, information theory, artificial intelligence, or non-visual data mining algorithms (e.g., neural networks, decision trees, unsupervised learning, etc). Each section of the book was designed to convey information pertaining to the use of visual data mining. Most of the discussions are derived from real-world experiences, so the descriptions are based on true and actual accounts of performing the work, rather than theoretical discussions or complex mathematical proofs. This means that you can apply these techniques to your own domains. There is even a section with comprehensive product descriptions that you can reference to get an understanding of the current state of visual data mining systems.

If anything, the book will give you a different perspective on the world of data mining. The topics are new, refreshing, and unlike anything else that has ever been written on the field of data mining.
— Chris Westphal (westphal@visualanalytics.com), the author

Read More Show Less

Customer Reviews

Be the first to write a review

( 0 )

5 Star

(0)

4 Star

(0)

3 Star

(0)

2 Star

(0)

1 Star

(0)

If you find inappropriate content, please report it to Barnes & Noble

Data Mining Solutions: Methods and Tools for Solving Real-World Problems

Overview

Customers Who Bought This Also Bought

More About This Book

Overview

Editorial Reviews

Booknews

Product Details

Related Subjects

Meet the Author

Read an Excerpt

Table of Contents

Interviews & Essays

Customer Reviews

5 Star

4 Star

3 Star

2 Star

1 Star