CS 784: Advanced Topics in Database Management Systems
Spring 2008, Tue/Thur 2:30-3:45pm, room 1289
Announcements
- We had no class on Mar 27.
- The midterm scheduled for Apr 8 will be moved to Thu Apr 17.
Instead, on Apr 8 Xiaoyong will give a lecture on mass collaboration
to build community wikipedias. Please read this
paper.
- We have no class on Apr 10.
- We will make up for the above two missing classes on Wed Apr 16
and Wed Apr 23, at 2:30-3:45pm in the same classroom.
- The exact sections that you have to read from the DI book have
been listed below, in the course schedule.
Instructor
AnHai Doan,
contact information available from my homepage. Office hours: Tue/Thur: 3:45-4:30pm
and by appointment (pls send email, thanks).
Course
Description
The official name of this course is
"Data models and Languages", a legacy name left over from the
past. For this semester, what I intend to cover is interesting
material that is not covered in 564 or 764, and that is relevant to
research and industrial development going on today in the broad
context of data management.
Prerequisites: Undergraduate knowledge of relational
databases is highly recommended. If not, you should be willing to do a
"crash course" on the topic in the first few weeks. The recommended books
for the crash course are:
The Cow Book, or
The Complete Book.
Course Format
The
course meets twice a week to discuss research papers. You are
required to read the specified paper before each lecture and attend
the lectures. There will be three exams, spread roughly evenly
throughout the semester, and a project.
First exam: Tue Feb 26, in class at usual time/room,
Second exam: Tue Apr 8, in class at usual time/room,
Third exam: Tue May 13, in class at usual time/room.
Other important dates: Jan 29: no class; March 18, 20:
no class, spring break; last class is May 8.
Grade: the three exams and the project will each be worth
25% of the grade.
Course Schedule
The paper list is below (may be revised slightly as the course
progresses). Each paper will be covered in 1-2 lectures.
Intro to the class (read Sections 1-2 of the Cimple paper)
On the universality of data retrieval languages
Deductive databases (Datalog), Ullman notes
Evaluation of recursive programs
Managing information extraction
You can find the SIGMOD tutorial, from which I created the above
lecture here.
Datalog applied to information extraction
Data integration: Several chapters from a
textbook-in-progress on data integration (you must have received this by now):
- Overview, virtual integration (Chapters 1-2)
- Query unfolding, query containment, and answering queries using views (Chapter 3.
Read only 3.1, 3.2.1, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.3.3, 3.3.4 (only the
bucket algorithm))
- Describing data sources (Chapter 4. Read only 4.1, 4.2, and skim 4.3, 4.4, 4.5, 4.6)
- Creating semantic mappings (Chapter 5. Read only 5.1, 5.2, 5.3, 5.4, and
the preamble of 5.5. Then read 5.6, and the first part of 5.9 (up to right before
the headline "Searching a set of possible schema mapping"))
- Data mapping
- Other integration approaches (Read this
two-page statement).
wiki/Web 2.0/mass collaboration
IR overview
Web search, Pagerank
Web search, Google
Web search and RDBMS
Keyword search over multiple RDBMSs
Data mining: association rules
Data mining: clustering
Column store vs. row store and hardware trends, DeWitt's note
Scalable Semantic Web data management using vertical partitioning
Project
The project submission deadline is Monday May 19, by 11am. Please email me a pdf copy
of your project report AND slide a hard copy of the project report under my office
door. The pdf copy is for record keeping, and the hard copy is for grading.
As discussed in the class, your project report is not required to adhere to any fixed format.
If you still have any question about this, please let me know.