Assignment 7: Relational Query Optimization
Due: Wednesday, December ??, 1996, at 5 p.m.
Instructor: Raghu Ramakrishnan
In this assignment, you will carry out a number of exercises involving the optimization of relational queries using the Minibase optimizer and visualization tool. You will not have to write any code, and your answers should be turned in as a text file. You will carry out this assignment in teams of two with the same partner as for previous assignments.
The chapter on optimization and the lecture transparencies should provide all the background information that you need on query optimization. The tool itself has an on-line help feature, and additional information is included in this handout as part of exercise descriptions.
To use the Minibase query optimization viewer, add the following line to your .cshrc.local file:
setenv MINIBASE_ROOT /p/coral/564/minibase
and, after your path statements, add:
set path = ($path $MINIBASE_ROOT/bin)
Then, log out and log back in. This is important for the above
changes to take effect.
The program to run is called minibaseview.
These exercises are based on three catalogs, which are located in files catalogA, catalogB and catalogD. Each exercise says which catalog to use. The first step of each exercise should be to open the appropriate catalog, which you do by choosing Open from the Catalogs menu of the visualization tool. Then you enter your query by choosing Enter SQL from the Query menu of the visualization tool.
Catalogs A and B are used in Questions 1 to 11 and are based on the same schema. There are three relations, Emp, Dpt, and Job, which represent employees, departments, and job titles of some organization. Each of these relations has an integer field which is used as the primary key for the relation---empid for the Emp relation, dno for the Dpt relation, and jno for the Job relation. The Emp relation has foreign keys referring to Dpt and Job, representing each employee's department and job title, and the Dpt relation has a foreign key referring to Emp, representing the department's manager.
Catalog D is used for all of the multi-column index exercises (Questions 12 to 15). A multi-column index on a relation is composed of an ordered sequence of a subset of the fields of that relation. Note that catalogD is much more complicated than the simple catalogA and catalogB files. This is because catalogD can be used in a realistic database---the special relations relcat, attrcat and indcat allow the catalog to describe itself and the rest of the relations in the database. However, you will not be using any of these special relations in this exercise. The second thing you should note is the way multi-column indexes are specified---via the use of percent signs. The order in which the field names are specified is also important (and this will be demonstrated in the exercises). CatalogD has five incarnations of the well-known and widely-used emp relation. Each of the Emp relations has its indexes built on either two or three of its fields (ename, title and dname).
Open Catalog A and run the tool on the following query.
select * from dpt
where mgrid = 20223
Answer the following questions, looking at level 0 access methods.
Level 1 of the optimizer visualization tool shows plans that consist of walking a single relation using one of the access methods defined for that relation. If you double-click the LEFT mouse button on a level 1 plan (and higher levels too), its box expands to show you more detailed information about the plan and the estimates made about it.
Use the same catalog, Catalog A, and the following query:
select * from emp
where sal = 58000
Answer the following questions, looking at level 1 plans.
Use the same catalog, Catalog A, and the same query as before:
select * from emp
where ename = "Frank"
Once again, open Catalog A and run this same query as in questions 1 and 2. Answer the following questions, looking at both Level 1 plans and Level 0 access methods.
select * from emp
where empid < 25000
Answer the following questions.
select ename, dname from emp, dpt
where emp.dno = dpt.dno and emp.jno < 75
Answer the following questions, looking at the best plan.
Open Catalog A, and run the following query.
select * from emp, job
where emp.jno = job.jno
Answer the following questions looking at Level 2 plans. (NOTE: be sure not to confuse LEVEL 2 with the line NUMBERED 2. The NUMBER of the line of plans you are to look at is 3, the top line of the display.)
select ename, dname from emp, dpt, job
where emp.dno = dpt.dno and emp.jno = job.jno
and dpt.dname = job.jname
From the Switches menu it is possible to disable index-only plans. If such plans are disabled, the optimizer will only generate plans that always retrieve tuples from the underlying relation, without checking to see if the index search key (and therefore the data entries in the index) contain all necessary information.
Run the tool on the following query, which lists the number of employees that are at each department if that number is greater than 10. Use Catalog A.
select dno, count(*) from emp
group by dno
having count(*) > 10
Run the following query using Catalog B.
select * from emp, dpt
where emp.dno = dpt.dno and emp.jno < 100
Also, if you double-click the RIGHT mouse button on the box that labels any row of the tool's display, you will turn off (and back on) the pruning of plans that the optimizer estimates are not the `least expensive'. (Remember that for each subset of relations, for each ordering of generated tuples, only the least cost plan is retained. The toggling referred to here controls this pruning; it should be used with care since the number of retained plans can increase rapidly without pruning!)
Run the following variant of the query from the previous exercise, using Catalog B as before.
select ename, dname from emp, dpt
where emp.dno = dpt.dno and emp.jno < 100
Now suppose you knew that the run-time environment for this query was limited to 10 pages of memory. Find the best (left-deep) plan that fits this requirement and describe its join method, access methods, and total cost.
Consider the following query, as run against Catalog B. The query lists the number of employees hired before their manager for each of the first 50 departments.
select M.ename, M.empid, count(*)
from emp E, emp M, dpt D where E.dno = D.dno and D.mgrid = M.empid and E.dno <= 50 and E.empid < M.empid group by M.empid, M.ename
Using Catalog D, type in the following query.
select * from emp1
where ename < "Brian"
Answer the following questions looking at Level 1 plans.
Using Catalog D again, type in the following query.
select * from emp1
where ename < "Brian" and title = "Bozo"
Answer the following questions looking at Level 1 plans.
Using Catalog D, type in the following query.
select * from emp1
where ename < "Brian"
Answer the following questions looking at Level 1 plans.
Using Catalog D, type in the following query.
select ename from emp1
where ename < "Brian"
Answer the following questions looking at Level 1 plans.
Using Catalog D, type in the following query.
select title, count(*) from emp1
group by title
Answer the following questions, looking at level 1 plans.
Using Catalog D, type in the following query.
select title, count(*) from emp3
group by title
Answer the following questions, looking at level 1 plans.
Copy all the files from ~cs564-1/fall96/proj6 into your own local directory. The files are:
You should turn in:
The assignment is due at 5PM on Dec 10 1996. The solution will be made public then, and solutions turned in after that time will not receive any credit. So be sure to turn in whatever you have, even if it is not working fully, at that time.
I emphasize that late submissions will not receive any credit. Computers -- and life! -- being what they are, expect everything to take longer than you expect, even taking this expectation into account. So start early, and plan on getting things done well before the due date. Nothing short of a nuclear explosion (in the CS building, not the South Pacific) constitutes a valid reason for an extension.