CS 764, Fall 2019: Topics in Database Management Systems

Coordinates: MWF 8:00-9:15 in 1257 CS (first class on Friday Sept-6)
Instructor: J. Patel
Office Hours: Wed 9:15-10:15AM or by appointment

Description

This course covers a number of advanced topics in the development of data management systems and the application of such systems in modern applications. The topics discussed include advanced concurrency control and recovery techniques, query processing and optimization strategies, advanced access methods, parallel and distributed data systems, extensible data systems, implications of cloud computing for data platforms, and data analysis on large datasets.

The course material will be drawn from a number of papers in the database literature. We will cover about 2-3 papers per week. All students in this class are expected to read the papers before coming to the lecture.

Prequesites: CS 564 or equivalent. If you have concerns about meeting the prerequisties, please contact the instructor.

Text: There is no formal textbook for this course. The reading list is a collection of papers, which is posted on the course web page.

Reference text: The following two sources will be used occassionally in this course. Note you don't need to buy these books.

Red Book: Readings in Database Systems (5th edition) - edited by Bailis, Hellerstein, and Stonebraker.
Cow Book: Database Management Systems (3rd edition) - by Raghu Ramakrishnan and Johannes Gehrke, McGraw Hill, 2003.

Course Project

A big component of this course is a research project. For the project, you pick a topic in the area of data management systems, and explore it in detail. I will provide a list of suggested project topics, though you are free to select a project outside of this list provided you get prior approval. I require that you meet with me periodically throughout the semester updating me on the progress of your project. I will help with the direction, but unless you take the initiative to actively explore the topic you choose, you are unlikely to accomplish much in the project (which will adversely affect your project grade).

The course project is a group project, and each group must be of size 2-3. Please start looking for project partners right away. I will facilitate informal group formation by holding a socializing session at the end of the first few lectures, but it is your responsibility to form and manage groups.

The course project will include an interim course project report, a short project presentation at the end of the semester, and a final project report. I organize the final project presentation in a workshop-like format. The workshop is called DAWN. To give you an idea of what the workshop looks like, see the program information for DAWN 2019 workshop.

Grading and Deadlines

Paper Summaries	10%	Students must read the papers before class, and turn in a 300 word summary for each paper before class.
Exam	40%	1.5 hours. Start times: 7:30AM, 7:45AM, or 8:00AM (you can choose), Oct 25. Rm: Psych 103 (Notice that this is not the usual classroom).
Course Project	50%	Project selection. Due Oct 21, 5:00PM. 5% for an interim course report. Due Nov 18, 5:00PM. 35% for project report and project code. The code must include a packaged tar ball with instructions on how to run the code, and some demo examples. Final report due by Dec 18, 5:00PM. 10% for short project presentation as part of a workshop called DAWN'19, held on Dec 9, and Dec 11.

Reading List

Here is the reading list. Links to local copies on this page are password protected. Passwords will be given out on the first day of class. This list is subject to change during the semester. When changes happen, I will let you know at least a week in advance.

Query Processing and Buffer Management
I will assume that you have covered chapters 12-15 of the Cow book, or its equivalent, in your undergraduate course.

Join Algorithms: Leonard D. Shapiro: Join Processing in Database Systems with Large Main Memories. ACM Trans. Database Syst. 11(3): 239-264 (1986)
Additional reading: Goetz Graefe: Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2): 73-170 (1993)
Additional reading: Laura M. Haas, Michael J. Carey, Miron Livny, Amit Shukla: Seeking the Truth About ad hoc Join Costs. VLDB J. 6(3): 241-256 (1997)
Additional reading: Jaeyoung Do, Jignesh M. Patel: Join processing for flash SSDs: remembering past lessons. DaMoN 2009: 1-8
Buffer Management: Hong-Tai Chou, David J. DeWitt: An Evaluation of Buffer Management Strategies for Relational Database Systems. Algorithmica 1(3): 311-336 (1986).
Additional reading: Elizabeth J. O'Neil, Patrick E. O'Neil, Gerhard Weikum: The LRU-K Page Replacement Algorithm For Database Disk Buffering. SIGMOD Conference 1993: 297-306
Additional reading: Jim Gray, Gianfranco R. Putzolu: The 5 Minute Rule for Trading Memory for Disk Accesses and The 10 Byte Rule for Trading Memory for CPU Time. SIGMOD Conference 1987: 395-398
Query Optimization-1: Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price: Access Path Selection in a Relational Database Management System. SIGMOD Conference 1979: 23-34
Additional reading: E. F. Codd: A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13(6): 377-387 (1970)
Query Optimization-2: Surajit Chaudhuri: An Overview of Query Optimization in Relational Systems. PODS 1998: 34-43
Additional reading: Kiyoshi Ono, Guy M. Lohman: Measuring the Complexity of Join Enumeration in Query Optimization. VLDB 1990: 314-325

Advanced Transaction Management
I will assume you have covered chapters 16-18 of the Cow book, or its equivalent, in your UG course.

Granularity of Locks: Jim Gray, Raymond A. Lorie, Gianfranco R. Putzolu, Irving L. Traiger: Granularity of Locks and Degrees of Consistency in a Shared Data Base. IFIP Working Conference on Modelling in Data Base Management Systems 1976.
Optimistic CC: H. T. Kung, John T. Robinson: On Optimistic Methods for Concurrency Control. ACM Trans. Database Syst. 6(2): 213-226 (1981).

Additional reading: Per-Åke Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M. Patel, Mike Zwilling: High-Performance Concurrency Control Mechanisms for Main-Memory Databases. PVLDB 5(4): 298-309 (2011) Slides.

B-tree Locking: Philip L. Lehman, S. Bing Yao: Efficient Locking for Concurrent Operations on B-Trees. ACM Trans. Database Syst. 6(4): 650-670 (1981)

Additional reading: Isolation Levels: Hal Berenson, Philip A. Bernstein, Jim Gray, Jim Melton, Elizabeth J. O'Neil, Patrick E. O'Neil: A Critique of ANSI SQL Isolation Levels. SIGMOD Conference 1995: 1-10.

Aries Recovery: C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh, Peter M. Schwarz: ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database Syst. 17(1): 94-162 (1992) Slides.
2-Phase Commit: C. Mohan, Bruce G. Lindsay, Ron Obermarck: Transaction Management in the R* Distributed Database Management System. ACM Trans. Database Syst. 11(4): 378-396 (1986)

Cloud Systems, Parallel DBMSs, and and Distributed DBMSs

Parallel DBMSs: David J. DeWitt, Jim Gray: Parallel Database Systems: The Future of High Performance Database Systems. Comm. ACM 35(6): 85-98 (1992).
Distributed DBMSs: Michael Stonebraker, Paul M. Aoki, Witold Litwin, Avi Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin, Andrew Yu: Mariposa: A Wide-Area Distributed Database System. VLDB Journal 5(1): 48-63 (1996).

Additional reading: Replication: Jim Gray, Pat Helland, Patrick E. O'Neil, Dennis Shasha: The Dangers of Replication and a Solution. SIGMOD Conference 1996: 173-182.
Additional reading: Eventually Consistent: Werner Vogels: Eventually consistent. Commun. ACM 52(1): 40-44 (2009).

MapReduce: Jeffrey Dean, Sanjay Ghemawat: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008).

Additional reading: BigTable: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber: Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26(2): (2008).
Additional reading: Parallel UDFs: Friedman, Peter M. Pawlowski, John Cieslewicz: SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2): 1402-1413 (2009)

Heron: Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, Siddarth Taneja: Twitter Heron: Stream Processing at Scale. SIGMOD 2015: 239-250. Slides.

Advanced Access Methods

R-tree: Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57.
Bitmap Indices: Patrick E. O'Neil, Dallan Quass: Improved Query Performance with Variant Indexes. SIGMOD Conference 1997: 38-49.
Additional reading: BitWeaving: Fast Scans for Main Memory Data Processing, Y. Li and J. M. Patel, SIGMOD 2013. Blog. Lecture Slides (Column Store, Bit-based Indexing, and BitWeaving).

Data Models

Data Models: Stonebraker and Hellerstein: What Goes Around Comes Around

Data Analysis and Decision Support

OLAP: Jim Gray et al.: Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Mining and Knowledge Discovery 1(1): 29-53 (1997).
Mining: Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499.

The Art of Reading Papers (read on your own)

Read Efficiently: Michael J. Hanson and updated by D. McNamee: Efficient Reading of Papers in Science and Technology, 1989.
Write Well: Patrick Valduriez, “Some Hints to Improve Writing of Technical Papers”, Correspondence in Engineering of Information Systems, Hermes, Vol. 2, No. 3, 1994.