CS 784: Advanced Topics in Database Management Systems
Fall 2011, Mon/Wed 2:30-3:45pm, room 1257 COMP S&ST
Announcements
- The final exam is in class, and is on Mon, Dec 19. The project due date is
by midnight Thur, Dec 22. Please submit a pdf copy of the project to anhai@cs.wisc.edu
and also slide a hard copy under my door. Please specify clearly the names of the people
in the project.
- Topics for the final
- Topics for the midterm
- The midterm date is set to be Nov 7, in class.
- No lecture and office hour on Wed Oct 12. Instead, we have a make-up lecture and office
hour on Fri Oct 14 in the same class, the usual time.
- Welcome. Make sure you are on the class mailing list.
- While the class normally meets just Mondays and Wednesdays, please reserve the 2:30-3:45pm
slots on Fridays. I will use these slots for make-up lectures. There is a small possibility that
we will do lectures on Friday as well (in which case there will be no lecture from mid Nov until
semester end).
Instructor
AnHai Doan, contact
information available from my homepage. Office hours: Mon/Wed
3:45-4:45pm (right after lectures) and by appointment (pls send email,
thanks).
Course
Description
The official name of this course is
"Data models and Languages", a legacy name left over from the
past. What this course will cover is fundamental and hot data
management issues beyond relational data management. The goals are to
help students prepare for the database qualifying exam, and get
exposed to current hot and interesting trends in beyond-relational
data management. Another way to view this is:
- CS 564 is "everything you should know so that you can get an industrial
job working with relational databases",
- CS 764 is "all the gory details you may (or
may not) want to know about relational data management systems", and
- CS 784 is "all
the stuff beyond relational data (e.g., Web, text, data mining, data integration, data extraction) that you should know to broaden your data management
knowledge or to work in the field as an advanced developer/researcher".
Prerequisites: Undergraduate knowledge of relational
databases is highly recommended. If not, you should be willing to do a
"crash course" on the topic in the first few weeks. The recommended books
for the crash course are:
The Cow Book, or
The Complete Book.
Course Format
The
course meets twice a week to discuss research papers. You are
required to read the specified paper/textbook chapter/slides before each lecture and attend
the lectures. There will be a midterm, a final, and a project.
Midterm: Nov 7, 2011, in class at usual time/room,
Final: Dec 19, 2011, in class at usual time/room,
Other important dates: Oct 12 no class; make-up class on Oct 14. Nov 24-27
no class, Thanksgiving; last class is Wed Dec 14.
Grade: Midterm: 30%, final: 30%, project: 30%,
participation in the class: 10%.
Course Schedule
Course schedule and the paper list is below (may be revised slightly
as the course progresses). Each paper will be covered in 1-2
lectures.
Data Integration
Several chapters from a data integration
textbook (to be published soon, I will send out the chapters shortly):
- Overview, big picture, key issues (Chapters 1-3)
- Creating semantic mappings (Chapter 7) PPT slides presented in the class.
- String matching, entity resolution (Chapters 6 and 8): Read 6.1, 6.2.1 (only "Edit Distance"),
6.2.2 (only "Overlap", "Jaccard", and "TF/IDF"), 6.3 (only "Inverted Index" and "Size Filtering").
Read 8.1, 8.2, 8.3, 8.4, 8.5.2 (only the part about learning with training data), 8.6 (only
the preamble and 8.6.1).
- Wrapper construction (Chapter 11): Read 11.1, 11.2, 11.3.1, 11.4, 11.5.2,, 11.6.
IR / Web Search
Read Chapter 27 (IR and XML Data) of the Cow Book, but only from
27.1 to 27.5.
IR overview
Web search, Pagerank
Datalog
Read Chapter 24 (Deductive Databases) of the Cow Book.
Deductive databases (Datalog), Ullman notes
Evaluation of recursive programs (scan it only)
Data Mining (tentative, awaiting syncing with CS 764)
Read Chapter 26 (Data Mining) of the Cow Book.
Data mining: association rules
Data mining: clustering
Colliding Worlds: Hot Emerging Topics
We will cover several hot emerging topics, such as big data, noSQL,
crowdsourcing, information extraction, and social media analysis.
- Information Extraction
Managing information extraction
You can find the SIGMOD tutorial, from which I created the above
lecture here.
In case you want to read more:
- Social Media Analysis
Slides will be mailed out later.
Project
Details to be posted later