CS 784: Advanced Topics in Database Management Systems
Spring 2009, Wed/Fri 11:00-12:15pm, room 1257 COMP S&ST
Announcements
- Midterm #1,
Midterm #2, and
Final of CS 784, Spring 2008.
(The Spring 2008 offering had two midterms.) For our upcoming midterm,
you should check out Midterm #1 above, and look at the first few
IR-related questions in the Final.
- We will have a midterm on Friday March 13, in class, at the usual time.
- Welcome. Make sure you are on the class mailing list.
Instructor
AnHai Doan,
contact information available from my homepage. Office hours: Wed 1:15-2pm and Fri 1:15-2pm,
and by appointment (pls send email, thanks).
Course
Description
The official name of this course is
"Data models and Languages", a legacy name left over from the
past. For this semester, what I intend to cover is interesting
material that is not covered in 564 or 764, and that is relevant to
research and industrial development going on today in the broad
context of data management.
Basically,
- CS 564 is "everything you should know so that you can get an industrial
job working with relational databases",
- CS 764 is "all the gory details you may (or
may not) want to know about relational data management systems", and
- CS 784 is "all
the stuff beyond relational data (e.g., Web, text, data mining, data integration, data extraction) that you should know to broaden your data management
knowledge or to work in the field as an advanced developer/researcher".
Prerequisites: Undergraduate knowledge of relational
databases is highly recommended. If not, you should be willing to do a
"crash course" on the topic in the first few weeks. The recommended books
for the crash course are:
The Cow Book, or
The Complete Book.
Course Format
The
course meets twice a week to discuss research papers. You are
required to read the specified paper before each lecture and attend
the lectures. There will be a midterm, a final, and
a project.
Midterm: March 13, in class at usual time/room,
Final: date to be decided, in class at usual time/room,
Other important dates: Feb 4: no class; March 18, 20:
no class, spring break; last class is May 8.
Grade: midterm: 30%, final: 30%, project: 35%, participation
in the class: 5%.
Course Schedule
The paper list is below (may be revised slightly as the course
progresses). Each paper will be covered in 1-2 lectures.
Intro to the class (read Sections 1-2 of the Cimple paper);
also read this paper
and this paper
IR / Web Search
IR overview
Web search, Pagerank
Web search, Google
Web search and RDBMS
Data Languages
Note: this part will form the foundation for you to study information
extraction and integration.
On the universality of data retrieval languages
Deductive databases (Datalog), Ullman notes
Evaluation of recursive programs
Information Extraction
Managing information extraction
You can find the SIGMOD tutorial, from which I created the above
lecture here.
Datalog applied to information extraction
Wrapper induction for information extraction
In case you want to read more:
Data Integration
Several chapters from a
textbook-in-progress on data integration (I will send out the book
draft shortly):
- Overview, virtual integration (Chapters 1-2)
- Query unfolding, query containment, and answering queries using
views (Chapter 3. Read only 3.1, 3.2.1, 3.2.2 (skim 3.2.2 only), 3.3.1, 3.3.2,
3.3.3, 3.3.4 (read only the bucket algorithm in 3.3.4))
- Describing data sources
(Chapter 4. Read only 4.1, 4.2)
- Creating semantic mappings (Chapter 5. Read only 5.1, 5.2, 5.3,
5.4, and the preamble of 5.5. Then read 5.6, and the first part of 5.9
(up to right before the headline "Searching a set of possible schema
mapping"))
- Data mapping (no reading, materials covered in the class)
- Other integration approaches (Read
this two-page statement).
In case you want to read more:
Others
wiki/Web 2.0/mass collaboration and also the CACM survey paper
(will be emailed to the class on Sunday or early Monday)
Keyword search over multiple RDBMSs
Data mining: association rules not covered, no need to read
Data mining: clustering not covered, no need to read
Column store vs. row store and hardware trends, DeWitt's note not covered, no need to read
Scalable Semantic Web data management using vertical partitioning
Project
By Fri Feb 27: each team pls send me an email listing the names and
email addresses of team members. Each team is 1-2 persons.
By Wed Mar 4: each team pls send me an email briefly describing the
project topic.