Read an Excerpt
Microsoft SQL Server 2008 Integration Services Problem-Design-Solution
By Erik Veerman Jessica M. Moss Brian Knight Jay Hackney
John Wiley & Sons
Copyright © 2010 John Wiley & Sons, LtdAll right reserved.
ISBN: 978-0-470-52576-0
Chapter One
SSIS Solution Architecture
Imagine that this is the first day of a new internal or client project. You will have responsibility on the data and processing layer of the solution, which involves processing data - a lot of data - from several sources, and then either integrating systems, or maybe consolidating data for reporting. Maybe your background is a developer, or a database administrator (DBA), or a data designer, and you know SSIS fairly well. But now they are calling you the "data and ETL architect."
ETL is the acronym for Extraction, Transformation, and Loading, which is used to describe the data-processing layer in a data-integration or data warehouse solution.
The project manager approaches you and says that the Vice President of Technology has asked the team to give him an estimate of the infrastructure needed. Furthermore, the business owner wants a high-level overview of how the solution architecture will help the company achieve the business need most efficiently. The project manager also wants your thoughts on the best way to approach the solution, how the development should be coordinated between team members, and how deploymentshould be handled.
Three meetings are scheduled for tomorrow to address these things, and you've been asked to provide your feedback in those areas.
Where do you start? How should you approach the solution design with SSIS as the main technology? How should all the pieces work together?
This chapter examines how to lay the foundation for successful solutions based on SQL Server Integration Services (SSIS). And, in fact, this whole book is about SSIS solutions to real-world requirements and challenges. It addresses questions such as the following:
What are the problems and issues?
What are the design principles?
How do you put it all together for a complete and successful solution?
Before you dive into the technical challenges of a project, you must first step back and ensure that you are laying the right foundation. Jumping right in is tempting! But resist the urge, because you want to (and need to) set the precedence and patterns for the solution upfront. If you don't, chances are you won't be able to go back and make changes down the road.
As with all chapters in this book, this chapter is organized into the following three major sections:
"Problem" - Coordinating and architecting an SSIS solution is not an easy task. The "Problem" section reveals some common challenges and common mistakes when planning and extending your ETL environment.
"Design" - The "Design" section in this chapter examines the right approach to a project, and the long-term project aspects that you should set in motion early in a project.
"Solution" - In many ways, the remainder of this book provides you with the solutions to make it all work together. This section launches you into the rest of the book, and shows how you can follow the chapters to build or redesign your SSIS solution.
Problem
Data and ETL projects have many challenges. Some challenges relate to data, some to enterprise integration, some to project coordination, and some to general expectations. This section begins by looking at the bigger picture of data within an organization, but then quickly looks at ETL projects and SSIS packages and execution.
Macro Challenge: Enterprise Data Spaghetti
Maybe your SSIS project is only a small solution in a bigger enterprise pond. The problem is that it can still cause a ripple effect when you tie it into your environment. Or, you can have challenges caused by an unwieldy enterprise environment when you try to implement your solution.
Figure 1-1 shows a not-so-nice telephone/electricity pole that illustrates the data nervous system of many organizations.
The problem with Figure 1-1 is that this mess didn't happen overnight! It grew into this twisted unorganized process because of poor planning, coordination, and execution. However, be aware that, a lot of the time, a corporation's politics may lead to this type of situation. Departments hire their own technical people and try to go around IT. Systems don't talk to each other nicely. Project pressures (such as time and budget) cause designers to cut corners.
Following are a few reasons why this kind of tangled mess happens in an organization's data processing, and examples of the many problems that this "unorganization" causes:
Debilitating dependency chains - The biggest problem is that often systems are built on top of systems on top of systems. The core source data has been connected through so many precedence links that it takes more time and administrative and development overhead. Systems at the source and in the middle of dependencies become un-replaceable because of the amount of effort that switching to a new system would take.
Unknown and uncontrollable data processing - The operations that process data are not centralized, and, in many cases, unknown because of department applications that are created without coordination with other applications. Processes run at uncontrolled times, and may impact systems within the processes even during peak times, which affects work efficiency.
Fragile enterprise changes - Changes to applications are difficult and costly. They may break processes, or cause data integration or reporting applications to be inaccurate.
Delayed data access - Even when the processes are somewhat controlled, the complicated system dependencies cause delays in data availability and nightly overhead processes that often run into mid-morning schedules. When they break, customer perception and employee efficiency are affected.
The "Design" section later in this chapter discusses how to approach your SSIS-based ETL project in the right way, and ensure that you are helping to solve the problem, rather than adding to it.
Micro Challenge: Data-Processing Confusion
Another common problem with data processing is when the logic contained to process data is overly complicated and confusing. Just like the macro enterprise problem, this problem usually is the result of changes over time where logic is modified and appended. It usually comes in one of two ways:
Runaway stored procedures - Procedures that run with complicated logic and lots of temporary tables, inserts, updates, and/or deletes can be difficult to manage, and are often inefficient. Supporting the procedures is also very difficult because the logic is difficult to follow, and, many times, the developers or DBAs who wrote the code are unavailable. Overall, this type of process requires a lot of administration and wasted time spent on following and learning the process.
Unmanageable packages - SSIS packages can also be designed with difficult-to-follow logic and sometimes complex precedence with hundreds of components used in a single package. These kinds of packages have challenges similar to those of runaway stored procedures, such as troubleshooting and the learning curve required for the process. Figure 1-2 shows the control flow of a package that has too many components to effectively manage. (The SSIS designer is zoomed in at 50 percent to fit on the screen.)
The overly complex control flow shown in Figure 1-2 is similar to an overly complex data flow, where too many components are used, thus making the development, troubleshooting, and support difficult to manage. The "Design" section later in this chapter proposes a better approach for SSIS packages called the modular package approach.
In summary, both of these types of processes (runaway procedures and unmanageable packages) are very difficult to support, and not suited to team development, error handling, and scalability (all of which are addressed in Chapter 12).
Problems with Execution and Troubleshooting
A couple of other issues that often come up in an ETL or data-integration solution are poor process coordination and difficulty doing root cause analysis. If the "what happened?" question can't be answered quickly and with confidence, then likely there is a problem with the overall solution execution and logging strategy.
Figure 1-3 shows the command-line output of an example SSIS package execution.
If you were to consider spending time trying to work through this output when trying to figure out what went wrong, then you should consider implementing a better execution and auditing structure. This includes package execution in your development environment.
If you have just turned on the out-of-the-box SSIS logging and are capturing results to output to a table, it still may not be enough. If you write custom queries every time against the SSIS logging table to figure out what happened, then you also need a better strategy.
Infrastructure Challenges
When designing an SSIS ETL solution, how do you determine the infrastructure components such as server hardware, memory, processor cores, network switches, disk arrays, storage networks, and I/O controllers? Related to that, where should you run your SSIS packages taking into consideration sources, destinations, and other applications, while balancing hardware scalability and location within your network topology?
These questions are not trivial, and the answers depend on a lot of factors, including processing windows, source and destination availability, application impact and availability, network bandwidth, fault-tolerance requirements, and so on.
I/O is usually the biggest bottleneck, and the one most often overlooked. I/O (or, more precisely, disk I/O) is the throughput that the system can handle on the drive volumes. And this challenge is not just about trying to get the greatest throughput on a single drive. You must consider staging and temporary environments, logging, and current and historical data. And you must balance it all with hardware availability and budget.
The reason disk I/O is so important when considering a data-integration or ETL effort is because of the nature of what you are doing, including the following:
Bulk operations - ETL and integration efforts typically process data in bulk, meaning that, when a process is kicked off (hourly, daily, or weekly), data is moved from sources to destinations with some transformation processes in-between. The processes usually move or integrate thousands or millions of records. That can be a lot of data that moves between systems, and it generates a lot of disk activity.
Source databases - Processes that are extracting a lot of data from sources incur disk overhead either by the sheer volume of the data, or when complicated queries against large tables require temporary operations that use disks (such as the SQL Server TempDB).
Destination databases - The nature of relational databases requires that data be stored to disk before a transaction is marked as complete. When inserting or updating a lot of data, the server must wait until the data is committed for the process to be complete.
Staging databases - Staging databases and tables are a common intermediate step in an ETL process and can be used for many reasons. Any time that data is landed to a staging database and then extracted from the staging database, it has the overhead of both the insert and the select, which can, at times, be done simultaneously with inserts into the destination database and, therefore, the I/O is compounded.
File management - A lot of ETL operations deal with files such as delimited files, XML, EDI, and so on. Large files require file management, archiving, and sometimes processing, and, therefore, incur disk I/O overhead.
The bottom line is that you will most likely have a disk I/O bottleneck in your data-processing operations, and you'll need to plan and manage for that to meet your service level agreements (SLAs) and performance requirements.
Other Challenges
The list of common project challenges can go on and on, but here are a few more:
Data challenges - Of course, you will have data challenges in your project anywhere from missing records to dirty data to bad data, and you will need to understand those problems as soon as possible so that you can set the expectations upfront about what can and what cannot be done about them. Although you can do a lot in SSIS (including fuzzy matching), magic is not on the component list - you can't make up data that doesn't exist. Don't overpromise. Be realistic.
Corporate political challenges - This book is about solving problems with a technology, namely SSIS. But, because you are trying to solve problems, you are going to be dealing with people. Everyone has an agenda, and a lot of times those agendas will not be in your project's best interest. Watch out for people who are threatened because you are changing the way things are done (even when it is for the better), or because your solution will be replacing one of their legacy applications, or because they are territorial about their "sandbox." You want to fix the enterprise spaghetti shown in Figure 1-1, but don't forget that some people have their tribal knowledge and make their living by keeping the corporate data nervous system tangled.
Requirement and scope challenges - Any project has scope creep. Just be careful about how the changes affect the project timeline, and don't leave data validation until the last day. You'll get burned!
Design
Now that you are scared, step back and take a deep breath. Designing an ETL process is doable, and, with the right approach in mind, you can be successful. This section discusses the overall design approach to an SSIS-based solution by examining the following:
Choosing the right tool
Solution architecture
Project planning
Package design patterns
Server and storage hardware
Package execution location
Choosing the Right Tool
This book is about applying SSIS. You are probably reading it because you assume that SSIS is the right tool for the job. That's probably the case. However, be sure to consider what you are doing, and ensure that using SSIS is in line with what you are doing.
Think about all the different types of data-processing needs that you have across your organization:
Data synchronization between systems
Data extraction from enterprise resource planning (ERP) systems
Ad hoc reporting
Replication (both homogeneous and heterogeneous)
PDA data synchronization
Legacy system integration
Data warehouse ETL processing
Vendors and partner data files integration
Line-of-business data processing
Customer and employee directory synchronization
As you may know, when it comes to data processing, a lot of tools are out there. Some are created for specific situations (such as folder synchronizing tools), whereas other tools are designed to perform a variety of functions for different situations. So, the traditional question often posed is which tool can best meet the business and logical requirements to perform the tasks needed?
Consider the host of tools found in the ever-evolving Microsoft toolset. You can use Transact SQL (TSQL) to hand-code a data load, Host Integration Server to communicate with a heterogeneous data source, BizTalk to orchestrate messages in a transactional manner, or SSIS to load data in batches. Each of these tools plays a role in the data world.
(Continues...)
Excerpted from Microsoft SQL Server 2008 Integration Services Problem-Design-Solution by Erik Veerman Jessica M. Moss Brian Knight Jay Hackney Copyright © 2010 by John Wiley & Sons, Ltd. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.