Read an Excerpt
Index Design for Performance Chapter (first 15 pages) Index Design for Performance ID.1 INTRODUCTION
It is important for both DBAs and programmer/analysts to understand how to use indexes. DBAs are usually concerned with physical design and maintenance, and programmer/analysts must understand how to write SQL statements that make the best use of indexes. Chapter IU describes the methods used by the index manager to efficiently locate rows.
This chapter describes issues related to designing indexes. It identifies the characteristics of columns that should and should not be indexed, and it describes the characteristics of columns that benefit the most from having a clustering index. Information about designing composite indexes are addressed.
ID.2 INDEX DESIGN GUIDELINES
Indexes can be used in a variety of ways and have a number of advantages. The primary advantage of indexes is the ability to process a small percentage of the rows efficiently with minimal I/O and CPU usage. We would like to see index usage when selecting, updating, deleting, and joining a small percentage (less than 5 to 10 percent) of rows in medium to large tables. A clustering index improves performance for processing a larger percentage of rows (less than 30 to 50 percent), particularly when the rows must be processed in sequence. Indexes can also be used to avoid access to data pages with index only processing. There are many reasons for having many indexes. However, there are costs associated with maintaining indexes that requires minimizing the number of indexes on a table.
Each time a row is inserted or deleted, the corresponding operation must be performed on each index. Each time an indexedcolumn is updated, the record identifier (RID) must be deleted and inserted for the new value. The costs of changing indexes is often more than the costs of changing the data. In addition, there is increased exposure to contention (latches and P-locks when using row locking and sysplex data sharing).
Whenever DB2 loads or reorganizes a table, it must build or rebuild each index to it. Whenever it recovers a tablespace, it must also recover or rebuild the indexes. The index itself may need to be reorganized (rebuilt) more frequently than the tablespace, as a result of index page splitting as discussed in Section ID.5. Finally, indexes require DASD space. Using indexes efficiently is a matter of selecting those whose performance benefits outweigh these costs. The benefits depend on the way the data are to be used and the table size.
Columns That Should Be Indexed
Primary keys and foreign keys are often searched or joined over a small percentage of rows and are good candidates for indexes. Indeed, the primary key must have a unique index to guarantee unique values in the column. If there is no index on the foreign key, an update of a primary key value requires a tablespace scan of each dependent table. This scan is done to verify that the old value does not exist in a dependent table. Similarly, when a row is deleted from a parent table and no index exists on the foreign key, it is necessary to do a tablespace scan on each dependent table to enforce the delete rule, described in Chapter CT. Both DB2 enforced and application enforced referential integrity benefit from having an index on the foreign key.
Joins are often performed on the primary key and foreign key columns; therefore, an index on these columns makes the join much more efficient in most cases.
Figure ID.1 Summarizes the characteristics of columns that benefit from an index.
Figure ID.1. Characteristics of columns that benefit from an index
* Primary key and foreign key columns.
* Columns which must have unique values to satisfy a business requirement.
* Columns that have column functions computed frequently (COUNT, SUM, AVG, MIN, and MAX for example).
* Columns used to test for the existence of a value (if the value does not exist, no data pages will be accessed).
* Columns which are searched or joined over less than 5 to 10 percent of the rows when considering a non-clustering index.
* Columns which are searched or joined over less than 30 to 50 percent of the rows when considering a clustering index.
* Columns frequently used together in a WHERE clause can benefit from a composite index to avoid maintaining multiple indexes
* Columns frequently used in an ORDER BY, GROUP BY, or DISTINCT clause to avoid sorts.
Columns That Benefit from Clustering
The choice of the column or columns to be included in the clustering index determines where DB2 inserts rows. The keyword CLUSTER, specified when the index is created, instructs DB2 to maintain the rows on the data pages in sequence according to the indexed column as described in Chapter IU.
The optimizer is likely to use the clustering index to avoid a sort for ORDER BY, GROUP BY, DISTINCT, and join processing. A column where these operations are often performed is a good candidate for the clustering index.
Columns frequently searched or joined over a range of values using the operators BETWEEN, >, /td> clustering index means that values are maintained in sequence on the data pages. A matching index scan can be used followed by a non-matching index scan to satisfy a range predicate. DB2 can use sequential prefetch to scan the data pages and the leaf pages.
A column with few distinct values (a low cardinality) is a good candidate for clustering if an index is required on the column at all. Generally it is not a good idea to create an index on a column with a low cardinality because the index does not narrow the search. However, if an index is required, a clustering index on such a column is a good choice because all of the like values are grouped (clustered) together on the data pages.
In earlier releases of DB2, the primary key was often chosen as the clustering index to allow for cursor repositioning operations. The use of the OPTIMIZE FOR n ROWS clause makes it unnecessary to use the clustering index for cursor repositioning, as described in Chapter PP. However, if there is batch processing requirements in which the input is in primary key sequence, the primary key is a good choice for the clustering index as described in Chapter BP.
The foreign key is a good candidate for frequent one-to-many joins (where many is large such as greater than 1,000, for example). If all of the foreign key rows are grouped together on the data pages, the join processing is more efficient. An index on the foreign key is also useful for cascading deletes and set nulls of foreign key rows with referential integrity constraint processing. If all of the J7s are grouped together on the data pages of a dependent table, a cascade delete of J7s can efficiently locate and delete all J7s that are grouped together.
Figure ID.2 summarizes the characteristics of columns that benefit from a clustering index. No one column has all of the characteristics listed here. Consider the trade-offs when deciding which column or columns should have the clustering index.
Figure ID.2. Characteristics of columns that benefit from a clustering index
* Column frequently processed in sequence using the operators ORDER BY, GROUP BY, or DISTINCT.
* Column frequently searched or joined over a range of values using predicates such as BETWEEN, >, /td>
* Column with a low index cardinality or skewed distribution (if an index is required).
* Primary key column is a good candidate for batch processing.
* Foreign key column is a good candidate for frequent one-to-many joins and enforcement of referential integrity (the many foreign values will be clustered together).
When using a partitioned tablespace, the clustering index must be the partitioning index.
Recommendation: It is very important to specify a clustering index based on your analysis of how the data is processed. If you do not create an index using the parameter CLUSTER, then by default the first index created (or more precisely the first index on the chain of indexes in the database descriptor (DBD)) is used to determine where to insert rows.
One problem with allowing DB2 to default to a clustering index is that when indexes are dropped and recreated, the first index on the chain may not be created first. DB2 starts using the new first index on the chain of indexes in the DBD to determine where to insert rows; what you thought was the clustering index is no longer used as such.
The REORG utility does not resequence rows if a clustering index is not explicitly declared. Further, the parameter SORTDATA on the REORG utility statement is ignored if there is no clustering index declared on a table. This can severely detract from performance. SORTDATA required 74 percent less elapsed time when reorganizing data with a cluster ratio of 80 percent in one case. Another deterrent to good reorganization performance is when SHRLEVEL CHANGE and NOSYSREC is used without an explicitly defined clustering index. With a clustering index defined, the rows are not written to SYSREC during unload and need not be sorted. However, without an explicitly defined cluster index this processing is necessary. The REORG OFFPOSLIMIT threshold applies only to explicitly defined clustering indexes as discussed in Chapter RR.
Consider using the following SELECT statements to verify that all tables created over the years do in fact have a clustering index.
SELECT I.TBCREATOR, I.TBNAME, I.CREATOR, I.NAME, I.CLUSTERING
FROM SYSIBM.SYSINDEXES I
WHERE NOT EXISTS
(SELECT 1
FROM SYSIBM.SYSINDEXES X
WHERE X.CLUSTERING = 'Y'
AND X.CREATOR = I.CREATOR
AND X.NAME = I.NAME)
ORDER BY I.TBCREATOR, I.TBNAME;
Recommendation: The rows that are returned from the SELECT statement do not have a clustering index defined. It is recommended that all tables have an explicitly declared clustering index.
Exception: If MEMBER CLUSTER is specified when creating a partitioned or simple tablespace as discussed in Chapter TS, rows are not clustered.
When Clustering Is Not Useful: Depending on the type of predicates used on the columns, some columns do not benefit from having a clustering index. If equal predicates are used on a column with a unique index, clustering has no advantages. One matching index scan can be used to locate the one row. With the exception of batch processing, there is generally no advantage to having a clustering index on a primary key with a unique index if equal predicates are used. If few rows are processed (less than 20, for example), the cost of using a non-clustering index can be acceptable. In addition, if the processing is mostly index-only processing, there is no need to access the data pages and the order of rows on the data pages does not matter.
Analyze How Data is Used
To understand the cost vs. benefits of using indexes, you must analyze how the data is used to determine which columns should be indexed. Identify columns that are frequently searched or joined, and estimate the percentage of rows processed by frequently executed SQL statements. Consider creating the clustering index on the column most frequently searched or joined in sequence. Consider non-clustering indexes on searched or joined columns in which less than 5 to 10 percent of the rows are to be selected, updated, deleted, and joined.
Refine this analysis with an estimate of the percentage of the rows that are inserted, updated, and deleted over a given period of time. Minimize the number of indexes when inserting, updating, and deleting more than about 10 percent of the rows on a weekly basis. If heavy update activity is concentrated on a weekly, monthly, quarterly, or annual basis, consider the techniques for avoiding index maintenance described in Chapter BP.
Composite indexes are useful when columns are frequently referenced together. They reduce the number of indexes that must be maintained and increase the chance of index-only retrieval.
These are only general guidelines. The physical design of the tables and indexes requires an analysis of how the data is to be processed. A process analysis matrix sometimes called a CRUD (Create, Retrieve, Update, and Delete) matrix is useful in summarizing how the data will be processed. Figure ID.3 is an example of such a matrix.