CS 784: Advanced Topics in Database Management Systems
Spring 2010, Tue/Thur 1:00-2:15pm, room 1325 COMP S&ST Cowzone
Announcements
Instructor
AnHai Doan,
contact information available from my homepage. Office hours: Tue/Thur 2:15-3:15pm
and by appointment (pls send email, thanks).
Course
Description
The official name of this course is
"Data models and Languages", a legacy name left over from the
past. For this semester, what I intend to cover is interesting
material that is not covered in 564 or 764, and that is relevant to
research and industrial development going on today in the broad
context of data management.
Basically,
- CS 564 is "everything you should know so that you can get an industrial
job working with relational databases",
- CS 764 is "all the gory details you may (or
may not) want to know about relational data management systems", and
- CS 784 is "all
the stuff beyond relational data (e.g., Web, text, data mining, data integration, data extraction) that you should know to broaden your data management
knowledge or to work in the field as an advanced developer/researcher".
Prerequisites: Undergraduate knowledge of relational
databases is highly recommended. If not, you should be willing to do a
"crash course" on the topic in the first few weeks. The recommended books
for the crash course are:
The Cow Book, or
The Complete Book.
Course Format
The
course meets twice a week to discuss research papers. You are
required to read the specified paper before each lecture and attend
the lectures. There will be a midterm, a final, and
an optional project.
Midterm: Mar 18, in class at usual time/room,
Final: May 6, in class at usual time/room,
Other important dates: Mar 27 until Apr 4:
no class, spring break; last class is May 6.
Grade: If you do the project, then midterm: 30%,
final: 30%, project: 30%, participation in the class: 10%. Otherwise,
midterm: 45%, final: 45%, participation in the class: 10%.
Course Schedule
Course schedule and the paper list is below (may be revised slightly
as the course progresses). Each paper will be covered in 1-2
lectures.
Datalog
Read Chapter 24 (Deductive Databases) of the Cow Book.
Deductive databases (Datalog), Ullman notes
Evaluation of recursive programs (scan it only)
Data Integration
Several chapters from a
textbook-in-progress on data integration (I will send out the book
draft shortly):
- Overview, virtual integration (Chapters 1-2)
- Query unfolding, query containment, and answering queries using
views (Chapter 3. Read only 3.1, 3.2.1, 3.2.2 (skim 3.2.2 only), 3.3.1, 3.3.2,
3.3.3, 3.3.4 (read only the bucket algorithm in 3.3.4))
- Describing data sources
(Chapter 4. Read only 4.1, 4.2)
- Creating semantic mappings (Chapter 5. Read only 5.1, 5.2, 5.3,
5.4, and the preamble of 5.5. Then read 5.6, and the first part of 5.9
(up to right before the headline "Searching a set of possible schema
mapping")) PPT slides
presented in the class
- Data mapping (no reading, materials covered in the class)
- Other integration approaches (Read
this two-page statement).
In case you want to read more:
IR / Web Search / Large-scale data analysis
Read Chapter 27 (IR and XML Data) of the Cow Book, but only from
27.1 to 27.5.
IR overview
Web search, Pagerank
MapReduce: simplified data processing on large clusters
Information Extraction
Managing information extraction
You can find the SIGMOD tutorial, from which I created the above
lecture here.
Datalog applied to information extraction
Wrapper induction for information extraction
In case you want to read more:
Data Warehousing, OLAP
Read Chapter 25 (Data Warehousing and Decision Support) of the Cow Book.
An overview of data warehousing and OLAP technology
Data cube: a relational aggregation operator generalizing
group-by, cross-tab, and sub-totals
Data Mining
Read Chapter 26 (Data Mining) of the Cow Book.
Data mining: association rules
Data mining: clustering
Colliding Worlds
MapReduce and parallel DBMSs: friends or foes?
MapReduce: a flexible data processing tool
Building
community wikipedias: a human-machine approach
Mass collaboration
systems on the World-Wide Web
Keyword search over multiple RDBMSs
Scalable Semantic Web data management using vertical partitioning
Project
Details to be posted later