The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses (Includes CD-Rom)

Want to read this on your NOOK? Request as NOOK Book from the publisher

Thank you for requesting this book as a NOOK book from the publisher.

More About This Book

Overview
Editorial Reviews
Product Details
Related Subjects
Meet the Author
Read an Excerpt
Table of Contents
Introduction

Overview

One of the most dramatic new developments in database design, the dimensional data warehouse is a powerful database model that significantly enhances managers' ability to quickly analyze large, multidimensional data sets. Written by the leading proponent of this revolutionary new approach, this valuable book/CD toolkit outfits you with all the nuts-and-bolts information you need to design, build, manage, and use dimensional data warehouses for virtually any type of business application, as well as software for querying dimensional data warehouses. Employing many real-life case studies of data warehouses, Ralph Kimball provides clear-cut guidelines on how to model data and design data warehouses to support advanced multidimensional decision support systems. Beginning with the relatively simple example of a data warehouse for a grocery store, he progresses, step-by-step, through an increasingly complex array of business applications in retail, manufacturing, banking, insurance, subscriptions, and airline reservations. By the end of the book, you will have mastered the full range of powerful techniques for creating, controlling, and navigating dimensional business databases that are easy to understand and navigate.

Designed for IS managers and database designers, this practical and thoughtful exposition explores dimensional data warehouse concepts. Dimensional database warehouses incorporate three or more dimensions, and lend themselves to multi-dimensional data modeling and analysis. Multi-dimensional databases can also include multimedia data, not just heterogeneous or encapsulated legacy data. This book focuses on data visualization for decision support systems.

Read More Show Less

Editorial Reviews

Booknews

A leading proponent of the dimensional data warehouse, a database model designed to improve managers' ability to quickly analyze large, multidimensional data sets, provides concrete tools for designing, building, managing, and using dimensional data warehouses for different types of business applications. A CD- ROM includes software for querying dimensional data warehouses and working models of all databases described in the book. Annotation c. Book News, Inc., Portland, OR (booknews.com)

Read More Show Less

Product Details

ISBN-13: 9780471153375
Publisher: Wiley, John & Sons, Incorporated
Publication date: 1/28/1996
Edition description: Older Edition
Edition number: 1
Pages: 416
Product dimensions: 7.54 (w) x 9.22 (h) x 0.89 (d)

Meet the Author

Ralph Kimball currently designs large data warehouses for a living, in addition to teaching and consulting. He also writes the Data Warehouse Architect column for DBMS magazine and was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. Dr. Kimball was vice president of applications at Metaphor Computer Systems, and founder and CEO of Red Brick Systems. He earned his PhD in electrical engineering from Stanford University.

Read More Show Less

Read an Excerpt

Chapter 9: Insurance

In this chapter we bring together ideas from nearly all the previous chapters to build a data warehouse for a typical insurance company. If you are in the insurance industry and you have jumped directly to this chapter for a quick fix, please accept the author's apologies, but this chapter depends heavily on ideas from the previous chapters. You need to read the previous chapters!

We will imagine that our insurance company is a $3 billion property and casualty insurer for automobiles, home fire protection, and personal liability. There are two main production data sources: all transactions relating to the formulation of policies, and all transactions involved in processing claims. Based on extensive interviews of claims processing managers, field sales managers, financial managers, and senior management, we conclude that the insurance company wants to analyze both the written policies as well as the claims. They want to see which coverages are most profitable and which are least profitable. They want to measure profit over time by covered item type (i.e., which kinds of houses and which kinds of cars), state and county, demographic profile, underwriter, sales broker and sales region, and event. Events are usually called catastrophes in this industry. The desire to see profit implies that both revenues and costs can be identified and tracked. The insurance company wants to understand what happens during the life of a policy, particularly when a claim is processed. The time from when a claim is first made to when the first payment is made is an important measure of the efficiency of the claims handling process. This can be measured only by looking at the detailed transactions.

In this chapter we will develop the design in the same way we developed the household data warehouse for the bank. We will start with the central dimensional schemas for policy creation and claims processing and then embellish them to handle monthly snapshots, heterogeneous products, and demographic minidimensions.

POLICY CREATION

We will assume that a policy is a "header" for a set of coverages sold to the insured party. Coverages are really the products that insurance companies sell. Homeowner coverages include fire, flood, theft, and personal liability. Automobile coverages include comprehensive, collision damage, uninsured motorist, and personal liability. In a property and casualty insurance company, such as the one we are modeling, coverages typically apply to a specific covered item, such as a particular house or car. Both the coverage and covered item are identified carefully in the policy. A particular covered item will usually have several coverages listed in the policy. We will assume that a policy can contain multiple covered items.

A policy is sold to an insured party by an agent or a broker. Agents are either employees of the insurance company or may be independent. Brokers are independent. Before the final policy can be created, a rater in the insurance company determines the rate that will be charged, given the particular coverages, the particular covered items, and the qualifications of the insured party. The final step of approval is made by an underwriter, who takes the ultimate responsibility for doing the business with the insured party. Typically the agent (or broker), rater, and underwriter are all different people.

We assume we capture the following transactions in the production policy creation system:

Create Policy; Alter Policy; Cancel Policy (with Reason)
Create Coverage on Covered Item; Alter Coverage; Cancel Coverage (with Reason)
Rate Coverage; Decline to Rate Coverage (with Reason)
Underwrite Policy; Decline to Underwrite Policy (with Reason)

In general, a given transaction comes with a lot of context. For example, when a Create Coverage on Covered Item transaction is processed, many things are known, including the insured party, the agent, the policy number, the coverage identification, and the covered item identification. If rating and underwriting have not happened yet, then the rater and underwriter are not known. it is important when designing the production extract system not to take the data found in the atomic transaction record so literally that some of the context is treated as unknown. Or, to state it more positively, a given atomic transaction must be embellished with as much context as possible, to create a full-dimensional description of the transaction.

The dimensional framework for the policy transaction fact table consists of the following:

Transaction date
Effective date
Insured party
Employee
Coverage
Covered item
Policy
Transaction

The transaction date is the date when the transaction was entered into the production system. The effective date is when the transaction is legally effective. These two independent dimensions can be implemented via a single physical table, which is declared via the SYNONYM construct to be two different tables as far as SQL is concerned, as described in Chapter 2.

The insured party is the customer. In all probability, the insured party dimension is a ditty customer dimension, in that no really serious attempt is made to identify previous instances of an insured party on other policies. The insured party can be multiple people, such as a person and their spouse, or the insured party can be a business entity. In general, insured party is a big dimension, which in the case of our $3 billion insurance company probably has a million or more entries. Later in this chapter we will definitely be using the big-dimension techniques on insured party that we developed in Chapter 6. . . .

Read More Show Less

	Introduction
1	Two Different Worlds	1
2	The Grocery Store	21
3	The Warehouse	49
4	Shipments: The Most Powerful Database	65
5	The Value Chain	81
6	The Big Dimensions	89
7	Financial Services, Especially Banks	107
8	Subscription Businesses	117
9	Insurance	125
10	Factless Fact Tables	143
11	Voyage Businesses	153
12	Building a Dimensional Data Warehouse	161
13	Aggregates	187
14	The Back Room	211
15	The Front Room	231
16	Front End Applications	243
17	The Future	279
	Appendix A. Design Principles for a Dimensional Data Warehouse	289
	Appendix B. A System Checklist for a Perfect Dimensional Data Warehouse	301
	Appendix C. A Glossary for a Dimensional Data Warehouse	307
	Appendix D. User's Guide for Star Tracker	321
	Appendix E. System Administrator's Guide for Star Tracker	367
	Index	381

Read More Show Less

Introduction

This is a book for you, the owners, managers, and implementers of a data ware house. It is a toolkit of design principles and techniques that when applied tc specific end user needs and specific legacy databases will allow you to plan and build an enterprise-level data warehouse. This book is much more specific than most books on this subject. We will take you all the way down to the bedrock of how to organize the data for legibility, and how to organize the data to answer important kinds of business questions. Once this bedrock is established, you will be able to build a major, enterprise-level data warehouse that will grow with your organization's changes and with your capacity to source more data for the warehouse itself.

This book is intended for the technical business analyst who works in the Information Systems (IS) department of a large orgaruzation and who either manages or is directly responsible for building the enterprise data warehouse If you are a technical business analyst then you have a multidisciplinary jot with skills and responsibilities both for the business information in your organization and for the appropriate use of computer technology to deliver this information to workers throughout your organization. Ideally, you have one of more of the following responsibilities:

Manager of corporate data warehouse development
Data Architect for the data warehouse
End User Applications Development Manager for the data warehouse
Data Extract Programmer between a legacy system and the data warehouse
Database Administrator for the data warehouse
End User Applications Programmer/Business Analyst for thedata warehouse

Conversely, if you have one ofthe following roles, you will find this book very useful as background information in order to understand what the people listed above need in their jobs:

* Vice President, Information Systems

* Chief Information Officer (CIO)

* System Administrator, UNIX Server Operations

* Network System Administrator

* End User Department Manager (the primary client of the data warehouse)

This book is aimed at the serious professional who has some understanding of his or her own IS department and some understanding of how computer technology is deployed in a typical large organization. The book is mildly technical, and definitely assumes that you can provide a one-sentence definition of a number of the major operational computers and applications in your environment, and that you can provide a one- sentence definition of terms like transaction processing decision support, legacy system, mainframe, network, personal computer, relational database, and graphical user interface.

This book develops the techniques of data warehouse design by example. In Chapters 2 through 11 we systematically develop all of the major examples that are needed to design any data warehouse. These chapters are not discrete handbooks for each industry type. 'These chapters are meant to be read by every reader, in order. For example, the chapter on insurance will be incomprehensible unless you have read the preceding chapters on grocery stores, manufactoring shipments, and big dimensions. In the author's experience, it is often easier to grasp the main elements of a design technique by stepping away temporarily from the all too familiar complexities of one's own applications in order to think about "another business." This approach also has the benefit that you Frill broaden your business exposure by thinking about other bustnesses.

In order to bridge the gap between your detailed understanding of your own business and the examples from other businesses, some care has been taken to provide an unusually complete glossary. Whenever a word is used in a narrow business context for the first time (like glossary), it is set in bold type and expanded in Appendix C. We hope you will find this breadth of industry examples and terminology understandable and interesting.

If you read this book, you will learn how to design the data warehouse for both end user legibility and the capacity to address the Key issues in your organization (Chapters 2 through 11). You will learn how to assess the needs of your organization and how to match these needs to the available data (Chapter 12). You will learn the classic design techniques for data warehousing including handling slowly changing dimensions, heterogeneous products, and accumulating snapshots (Chapters 2 through 6). You will learn the main administrative responsibilities of running a data warehouse (Chapters 12 through 16). You will learn how to plan for the crucial step of adding prestored summaries or aggregates to your data warehouse (Chapter 13). You will learn what the back room and front room administrative and development responsibilities are (Chapters 14 and 15). You will learn in detail what the application and reporting challenges are in transferring data from the database management system (DBMS) onto the user's desktop in usable formats (Chapter 16). And finally, you will learn about some of the new areas of research and development in data warehousing from which you can expect significant new product development from data warehouse vendors in the next two to three years (Chapter 17).

This book comes with a CD-ROM that contains data and software. Please refer to the read.me.TXT on the CD-ROM for any last minute changes and updates. Each database example in this book is provided on the GD-ROM as a Microsoft Access database in exactly the form shown in the book. Although each database is artificially generated, some care has been taken to make the examples realistic and big enough to be interesting. Many of the central fact tables have up to 10,000 records. The enclosed software is a Visual Basic reporting application called Star Tracker. There is an enclosed license agreement that you will have to read and agree to in order to use Star Tracker and in order to keep me from being liable for your data warehouse. End user and administrative manuals for using Star Tracker are included in this book as Appendices D and E. Star Tracker is all you need to use the CD-ROM and the enclosed databases in Microsoft Windows environments. You can also take Star Tracker and point it at any other appropriately structured dimensional database that presents an ODBC interface to your PC, including Red Brick, Oracle, Sybase, and Informix. Software updates for Star Tracker are continuouslyy available for free on the author's web home page at http://www.rkimball.com.

DIMENSIONAL MODELING

Fundamentally, this is a book about dimensional modeling and how to build a dimensional data warehouse and keep it running. Dimensional modeling is a new name for an old technique for making databases simple and understandable. When a database can be visualized as a "cube" of three, four, or even five or more dimensions, people can imagine sheing and dicing that cube along each of its dimensions. This book shows how to build dimensional models of businesses that are easily understood and navigated by end users. Perhaps just as important, these models can be understood and navigated by software so that user interfaces can be made simple and the performance of the quedes can be made acceptable. Finally, this book is about applying the techniques of :dimensional modeling specifically to relational databases.

WHY MODEL?

Dimensional modeling gives us the ability to visualize data The ability to visualize something as abstract as a set of data in a concrete and tangible way is the secret of understandability. Let's try a simple example. Imagine a business where the CEO describes what the company does as follows:

'We sell products in various markets, and we measure our performance over time."

As data warehouse designers, we listen carefully to these words and we add o ur own special emphasis:

'We sell Products in various Markets, and we measure our performance over Time."

Most people find it easy to think of this business as a cube of data, with labels on each of the edges of the cube, such as shown in Figure 1.1. Any point inside the cube is at the intersection of the coordinates defined by the edges of the cube. For the business described above, we label the edges of the cube as Product, Market, and Time. Most people can imagine that the points inside the cube are where the measurements of the business for that combination of Product, Market, and T me are stored. This is the dimensional model.

If this perspective seems too simple, then good! A model of data that starts by being simple has a chance of remaining simple at the end of the design. A model of a business that starts by being complicated will surely be complicated at the end. In Chapter 1 we will show how this simple "cubist" approach is implemented in a relational database.

Now compare the dimensional model of the business with a data dependencies model of the business as shown in Figure I.2. Here we have taken the same business, but instead of talking to the CEO we have dug up the detailed data entity charts of the business that describe how every item on a sales invoice relates to every other item and what all of the many-to-many and many-to-one relationships between data elements are. This picture certainly reveals more detail about the data relationships than the dimensional picture does. But does it contribute to understanding the business? Unfortunately, most people cannot hold a diagram like this in their minds and cannot understand how to navigate it usefully. There is a false sense of security in the diagram detail. "If it's detailed, it must be good." The truth is that relationships among data are best viewed dynamically on a screen, not by static road maps that the users try to hold in their minds. Both the dimensional model of a business and the data dependencies model of a business are capable of storing exactly the same data, and are capable of supporting extactly the same final business analyses. It's just that we are choosing to present the data differently. The dimensional model is a topdown model (notice that we started by talking to the CEO), and the data dependencies model is a bottom-up model.

Even if people don't think about a dimensional model so explicitly, they still have a natural sense of Reconstructing our example business by the obvious components of Product, Market, and Time. They will even use the word dimension in such a conversation, whether or not they are visualizing a cube of data.

The central attraction of the dimensional model of a business is its simplicity. We will see in this book that simplicity is the fundamental key that allows users to understand databases, and allows software to navigate databases effciently. In many ways the dimensional design process amounts to "holding the fort" against assaults on simplicity. By consistently returning to the top-down perspective, and by refusing to compromise on the goals of user understandability and software performance, we can maintain a coherent design of a database that serves the needs of a data warehouse.

The distinction between the dimensional model and the data dependencies model is at the very center of data warehouse design. Crudely put, if you as the reader get nothing else from this book than the conviction that your data warehouse must be built from a simple dimensional perspective rather than from a complex data dependencies perspective, then it will have served its purpose. The rest of this book systematically extends and deepens the ideas behind dimensional modeling.

SNATCHING DEFEAT FROM THE JAWS OF VICTORY

We are in danger of letting the relational database revolution pass by without making good on the original promise of delivering data that can be accessed every which way. Why did we begin turning our IS shops upside down in the early 1980s, replacing our flat file databases and our hierarchical databases with relational databases, if it wasn't for the dream of much more flexible access? We are now very far along with replacing all of our production database systems with relational technology, yet amazingly, we have lost track of the original "fatal attraction" that led us to relational databases in the first place.

Relational databases were supposed to provide equal access. If your sales database was built around product, market, and time entities, it wasn't supposed to matter whether you asked the product question before you asked the market question, or vice versa. Either approach was just as good. The older hierarchical databases, like IMS, forced you to ask your business questions in a fixed order, beginning with the root of the database. Freed from the shackles of asking business questions in a set way, with relational databases we were supposed to finally unlock our corporate information.

The early writings on relational databases were full of the promise of equal access. Chris Date's wonderful book, An Introduction to Database Systems, published in the early 1980s by Addison-Wesley, is filled with simple and compelling examples of equal access. Who can forget his examples of parts and suppliers and cities? Or is it suppliers and cities and parts? It doesn't matter . . . it's relational. It is very revealing to page through his book and see that there are no discussions of transaction processing and EIR diagrams. All of that came later.

Even though we bought the relational dream in the early 1980s, we couldn't use relational databases very effectively because there wasn't any data in them yet. So we began to use relational technology to capture primary business production data like orders, invoices, and business transactions. Almost immediately we ran into a serious problem: The early relational systems were pitifully slow for transaction processing. A typical transaction rate was about one per second. No large business could possibly run at one transaction per second. For example, today's SABRE system, the reservation system for American Airlines, routinely processes 4,000 transactions per second during heavy loads, and is capable of peak bursts in excess of 13,000 transactions per second.

The transaction processing performance crisis we had in the 1980s was both good news and bad news. It was good because it caused the database vendors to seriously improve and strengthen the relational database software so that transactions could be performed fast and reliably. This was a necessary precursor to making relational databases "production capable." Today it is possible to buy off the shelf a UNIX processor and a relational database that are capable of a sustained transaction rate of 1,000 per second. This is a monstrous improvement from the early 1980's that would have seemed impossible at that time. It used to be said that "the price we have to pay for the wonderful flexibtlity of relational databases is that they will always be slow." Fortunately, this view was wrong.

The good news of the big transaction processing performance gain, however, is tempered with the bad news that we have become fixated by transaction processing. We have become so steeped in the terminology and discipline of online transaction processing (OLTP), that we have a whole generation of IS professionals who think that the point of relational databases is to get data in rather than get data out. Actually, the opposite is true. Transaction processing is an awkward graft onto relational databases, and to do transaction processing well in this environment we have to ignore or even misuse many of the core facilities in relational databases. In Chapter 1, we will systematically explore the differences between OLTP and dimensional data warehousing, and learn how to think separately about getting the data in and getting the data out. We need to return to our roots.

THE GOALS OF A DATA
WAREHOUSE

The data warehouse is the place where people can access their data. The fundamental goals of a data warehouse can be developed by walking around the halls of any large organization and listening to management talk. The recurring themes heard from management sound like this:

"We have mountains of data in this company but we can't get access to it."
"Nothing drives senior management crazier than to have two people present the same business result but with different numbers."
"We want to slice and dice the data every which way."
"Just show me what is important."
"Everyone knows that some of the data isn't very good.".

These concerns are so universal that they drive the bedrock requirements for the data warehouse. Let us turn these problems into opportunities and state them as requirements:

1. The data warehouse provides access to corporate or organizational data.

Access means several things. The managers and analysts of an organizaffon must be able to connect to the data warehouse from their personal computers. This connection must be immediate, on demand, and with high performance. It is not acceptable if the access is through another person, or if the access is unreliable or slow. High- performance access means the tiniest queries run in less than one second. Access also means the tools available to the managers and analysts are very easy to use. Chapter 16 will discuss this issue in much more detail, but "easy to use" means a useful report can be run with one button click after opening the tool, and the report can be changed and rerun with two button clicks.

2. The data in a data warehouse is consistent.

Consistency means that when two people request the sales for the Southeast region for January they get the same number, even if they request the data at different times. Consistency also means that when these people ask the data warehouse what the definition of the "sales" data element is, they get a useful answer that lets them know what they are fetching from the database. Consistency also means that if yesterday's data has not been completely loaded, the analyst is warned that the data load is not complete and to not expect the final data load until tomorrow.

3. The data in a data warehouse can be separated and combined by means of every possible measure in the business (the classic slice and dice requirement.)

The slicing and dicing requirement speaks directly to the dimensional approach. We will see in Chapter 1 that a more operational definition of slicing and dicing is row headers and constraints. Row headers and constraints will turn out to be the fundamental building blocks of every data warehouse application and they will come directly from the dimensions in our data model.

4. The data warehouse is not just data, but also a set of tools to query, analyze, and present information.

The "back room" components, namely the central data warehouse hardware, the relational database software, and the data itself, are only about 60 percent of what is needed for a successful data warehouse. The remaining 40 percent is the set of front end tools that query, analyze, and present the data. The "show me what is important" requirement needs all of these components. These points are developed in detail in Chapter 16.

5. The data warehouse is the place where we publish used data.

The responsibility to publish is at the very core of the data warehouse. Data is not simply accumulated at a central point and let loose. Rather, data is carefully assembled from a variety of information sources around the organization, cleaned up, quality assured, and then released only if it is fit for use. If the data is unreliable or incomplete, the responsible data quality manager does not allow it to be published to the user community. The data quality manager plays much the same role as a magazine editor or a book publisher. He or she is responsible for the content and quality of the publication and is identified with the deliverable.

6. The quality of the data in the data warehouse is a driver of business reengineering.

The best data in any company is the record of how much money someone else owes the company. Data quality goes downhill after that. Frequently a data element would be very interesting if it were of high quality, but it either isn't collected at all or it is optional. Optional is the kiss of death for data.

The data warehouse cannot fix poor quality data. If an automobile in surer does not require the "cause of accident" data to be collected by the field adjusters, then there is nothing the data warehouse can do when this data arrives at the front door. The only way to fix poor-quality data is for both affected data entry personnel and management to return to the source of the data with better systems, better management, and better visibility of the value of good data. Interestingly, often a good way to justify such a siness reengineering project is to go ahead and publish the incomplete data, and then let natural pressure arise within the organization when people see how valuable the data would be if only it was of better quality. In this way,the data warehouse can play a key role in the business reengineering efforts in an organization.

THE GOALS OF THIS
BOOK

This book is a practical guide to owning and building a data warehouse. A major goal of the book is to teach the processes of (1) assembling a proper set of requirements for the data warehouse in your organization; (2) doing the logical design of the warehouse data structures; (3) planning the data extract and transformation steps down to the individual data element; (4) building a front end tool suite; and (5) managing the completed data warehouse. This book is written from years of hands-on experience designing large data warehouses and launching IS organizations in the right direction. Every one of the examples developed in Chapters 2 through 11 is an operational data warehouse in a large corporation designed by the author. All of these warehouses are in the 10 gigabyte to 1 terabyte range and the largest queryable tables range up 1 billion records. A second goal of this book is to communicate a set of standard techniques for data warehouse design. The OLTP side of the house has developed a powerful and precise vocabulary for OLTP design. Serious OLTP designers know that transaction systems that pass the ACID test possess the qualities of atomicity, consistency, isolation, and durability. We on the data warehouse side of the house need an equally powerful vocabulary and set of design principles. Fortunately, data warehouse design is replete with a number of powerful, consistent principles. This book names and categorizes these principles and ties each of them to characteristic businesses. In this way they are easier to remember. Appendix A of this book summarizes this data warehouse design vocabulary.

A third goal of this book is to help the reader understand where the data warehouse part of the market is going and to help the reader become an effective voice for necessary change. Data warehousing is not very mature. Although the vendors are doing their best to define what is important in this market, most of the advances have been far too technology driven and no enough user driven. It is amazing that we are being sold on the benefits of scanning billion-row tables from start to finish on parallel processors but most of the vendors don't have a STOP command for runaway queries. It is amazing that we have logically sophisticated structured query language (SQL) commands like EXISTS and ANY being implemented faithfully by all the DBM vendors, but we don't have commands or tools that can effectively compare one number with another. If you don't believe this, stay tuned. A final goal a this book is to show how well (and how poorly) SQL and relational database match with simple business analysis.

Read More Show Less

Customer Reviews

Be the first to write a review

( 0 )

5 Star

(0)

4 Star

(0)

3 Star

(0)

2 Star

(0)

1 Star

(0)

If you find inappropriate content, please report it to Barnes & Noble

The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses (Includes CD-Rom)

Overview

More About This Book

Overview

Editorial Reviews

Booknews

Product Details

Related Subjects

Meet the Author

Read an Excerpt

Chapter 9: Insurance

Table of Contents

Introduction

Customer Reviews

5 Star

4 Star

3 Star

2 Star

1 Star