CS 784: Advanced Topics in Database Management Systems
Fall 2015, Wed/Fri 2:30-3:45pm, room 113 Psychology Bldg
- ROOM CHANGE: Starting Wed Sept 9, we will meet in Room 113 Psychology Bldg
- This semester, the theme of this class will be data science.
- Make sure you are on the class mailing list:
firstname.lastname@example.org. You should have been added to
the list automatically via your wisc.edu address (if you are registered for the class).
- While the class normally meets just on Wednesdays and
Fridays, please reserve the 2:30-3:45pm slots on
Mondays. I will use these slots for additional and make-up
lectures. (We will have just a few of these makeup lectures, if any.)
AnHai Doan, contact
information available from my homepage. Office hours: Fri 4-5pm and by
appointment (pls send email, thanks).
The official name of this course is
"Data models and Languages", a legacy name left over from the
past. What this course will cover is fundamental and hot data
management issues beyond relational data management. The goals are to
help students prepare for the database qualifying exam, and get
exposed to current hot and interesting trends beyond-relational
data management. Another way to view this is:
Prerequisites: Undergraduate knowledge of relational
databases is highly recommended. If not, you should be willing to do a
"crash course" on the topic in the first few weeks. The recommended books
for the crash course are:
The Cow Book, or
The Complete Book.
- CS 564 is "everything you should know so that you can get an industrial
job working with relational databases",
- CS 764 is "all the gory details you may (or
may not) want to know about relational data management systems", and
- CS 784 is "all the stuff beyond relational data (e.g., Web, text,
data mining, data integration, data extraction) that you should know
to broaden your data management knowledge or to work in the field as
an advanced developer/researcher".
Some knowledge of machine learning (especially supervised learning) is helpful.
If you haven't had any exposure to machine learning before, you can read up
on the topic in the first few weeks of the class. We will also cover the
most basic stuff of supervised learning in the first few lectures.
Knowing Python is helpful for the class project. If you don't know it, use
this as an excuse to soak it up this semester. It is a relatively easy language
to learn and start using. We will discuss prerequisites more in the class.
course meets twice a week to discuss research papers. You are
required to read the specified paper/textbook chapter/slides before each lecture and attend
the lectures. There will be a midterm, a final, and a project.
Midterm: TBD, in class at usual time/room,
Final: TBD, in class at usual time/room,
Other important dates: first class: Wed Sept 2,
Thanksgiving break: Nov 26 - Nov 29, last class: Fri Dec 11.
Grade: Midterm: 30%, final: 30%, project: 40%.
Course schedule and the paper list are below (may be revised slightly
as the course progresses). Each paper will be covered in 1-2
lectures. Some topics below refer to chapters in a
textbook. I will email
the scanned copies of these chapters. Slides for the chapters are
available on the book's website.
The Big Picture and Preliminaries
Data Acquisition and Pre-processing
- What are the buzzwords out there and how they fit together? Big data, data science, NoSQL,
crowdsourcing, social media, cloud computing, what else? Course focus and outline.
- RDBMSs: the key ideas, the state of the art, the need to go beyond RDBMSs, motivation for Big Data and NoSQL.
- Big Data: Beckman Report on Database Research Self-Assessment Read the introduction
and scan the rest.
- Brief introduction to machine learning.
- Data science (getting value out of data):
- How statisticians view data science (misc/extra materials):
- Data acquisition, knowing the sources, knowing their qualities. Data lake.
- Extraction (aka extracting structured values out of unstructured data):
- Types of extraction: Wrapper and IE (information extraction)
- Wrapper construction (Chapter 9): Read 9.1, 9.2, 9.3.1, 9.4, 9.5.2.
You can find slides for wrapper construction on the
- Case studies:
- IE from text
- Case studies:
- Data understanding, cleaning, and transforming
Data Exploration and Analysis
Beyond Data Science
- Schema matching and mapping, briefly on ontology matching (Chapter 5,
slides presented in the class): Read 5.1 to 5.5, scan 5.6, read
5.7 to 5.9, scan 5.10.
- String matching (Chapter 4, slides on the book website): Read 4.1,
4.2.1 (only "Edit Distance"), 4.2.2 (only "Overlap", "Jaccard", and
"TF/IDF"), 4.2.4, 4.3 (only "Inverted Index" and "Size Filtering").
- Data matching (Chapter 7, slides on the book website): Read 7.1,
7.2, 7.3, 7.4, 7.5.1, 7.5.2, 7.5.3, 7.6 (only the preamble and 7.6.1).
- Main data integration approaches: materialized (ETL, data
warehousing), federated, virtual (GAV, LAV): read Chapter 1.
Potentially interesting stuff. I haven't read these carefully.
Students will form 2-person teams for a multi-stage project that addresses
a data science problem. Will discuss in the class.