Read an Excerpt
Chapter 6: Populating the Oracle Data Warehouse
Of all the work done before the data warehouse is ready to open, the average warehouse development team will spend less than 20 percent of its time design-ing the warehouse, configuring hardware, installing software products, and the like. Easily 80 percent of the work involved in setting up a data warehouse is devoted to locating, extracting, filtering, scrubbing, and finally loading data. Of these tasks, loading is by far the easiest. Resolving the various (and nefarious!) problems that arise when we bring together data from multiple source systems is invariably more difficult than we anticipated.
We haven't devoted a lot of space in this book to planning an overall data warehouse project, but we will provide one guideline here. Take whatever amount of time you believe will be necessary for data extracts, data scrubbing, and data loading; then double that total amount and add it into the schedule for resolving data problems. And then make sure there is some additional slack in the schedule to allow for the data that still isn't quite right!
It is always hard for the first-time warehouse analyst to accept, in advance, the fact that the majority of the operational systems (with which we are success-fully running our businesses) are not internally consistent-that somehow they continue to function even with dirty data! After the first warehouse load experi-ence, however, a healthy dose of skepticism is inevitably introduced. Among many other valid reasons for starting on a small scale, this is sufficient justifica-tion for basing Phase I of the warehouse on only a very small number of subject areas and sourcesystems.
The complexity of the integration process grows rapidly as the number of subject areas and source systems are increased. For a very simplified example, Figure 6.1 demonstrates that the probability of a data conflict is proportional to the number of system integration points to be resolved. If we are integrating data from two sources, there is only one integration point to be resolved. With a triangle of three sources, there are three "edges" to be integrated. As we move to four sources, we have to deal with six potential conflicts. Five sources yields ten opportunities for integration problems, while with six sources we have to resolve discrepancies arising from fifteen different integration points.
It is easy to see how the complexity of integrating data from many sources rapidly becomes unmanageable. By starting the data warehouse with, say, three subject areas or sources, we can gain experience while dealing with a relatively small number of data integration points. A second phase can then add three more data sources to the existing warehouse with only an additional six areas of potential conflict (since the existing integrated warehouse now acts as a single source). This phased approach actually presents less total work (3 + 6 = 9 integrations) than attempting to do a "big bang" implementation (15 integrations!) and is more manageable.
Dealing with Dirty Data
Integrating data from multiple sources is difficult under the best of circumstances. Unfortunately, we have never seen the best of circumstances! One should expect differences between the names, formats, and encoding of data being extracted from multiple sources. An organization's many operational systems should naturally be expected to have been designed by different teams at different times and probably built using different file or database management systems. The old payroll system tracks employees by social security number in VSAM files while the newly purchased project accounting system tracks consulting employees by name. Sometimes project accounting's employee name doesn't exactly match the name carried in the payroll system. Payroll also keeps track of all employees, some of whom aren't in project accounting because they don't work on billable projects. To make things worse, the project accounting system tracks some independent contractors who aren't even company employees.
On top of these expected differences that have to be resolved, it is virtually guaranteed that we will find other surprises in the data we extract. Our newly designed data warehouse has a carefully architectured environment in which we document and incorporate specific data rules. We find that many of our existing operational systems were built without rigorous enforcement of various business rules. in other cases, we find that the application code has been enforcing some rules that aren't documented-we "discover" these rules as we try to make sense of the extracted data.
It may seem surprising, but it is not uncommon, to encounter dirty data: duplicate records, missing "parents" from what should be parent/child relationships, denormalized data that can't be renormalized. Imagine a system developed in COBOL and VSAM files in which the record layout is designed to hold everything necessary about an employee's weekly timesheet. (This design allowed the original COBOL programmer to avoid complex multiple input logic.) The record contains repeating groups to track the multiple projects, hours, and days on which an employee might work during the week. The record also carries the information about the employee, their pay grade and organization code. Such a structure makes the payroll processing fairly simple. What happens, however, if the organization code or other employee information changes from one week to the next? The next week's payroll record simply carries the new values, and payroll churns along happily. What happens, however, if the pay grade changes on the first day of a month that falls on a Wednesday? Well, the original programmers found an easy solution, they just create two timesheet records for that employee for that week, one with sixteen hours and the other with twenty-four hours at the higher rate of pay. Payroll processing still works fine, but when we extract for the data warehouse we expect to load data into a normalized structure with primary keys. We have to resolve the two records into one parent row with multiple child rows for each day and project code.
We also expect that every project code we encounter in a payroll record will already be part of our PROJECTS dimension table. Since the payroll system is only concerned with how many hours the employee worked and how much money to pay per hour, it never cared about specific project codes....