CS 784 Advanced Topics in Database Management Systems: Data Science
Spring 2016, Wed/Fri 1-2:15pm, Room 1325 CS Bldg
- The class's Piazza page is now live
here. Pls use
this one to discuss among yourselves. It won't be monitored by the
- Make sure you are on the class mailing list:
email@example.com. You should have been added to
the list automatically via your wisc.edu address (if you are registered for the class).
- While the class normally meets just on Wednesdays and
Fridays, please reserve the 1-2:15pm slots on
Mondays. I will use these slots for additional and make-up
lectures. (We will have just a few of these makeup lectures, if any.)
AnHai Doan, contact
information available from my homepage. Office hours: Fri 4-5pm and by
appointment (pls send email, thanks).
The official name of this course is
"Data models and Languages", a legacy name left over from the
past (which hopefully we will be able to change soon). This semester
the course is an introduction to data science. You can get an idea
for what will be covered by looking at the course syllabus below.
Prerequisites: Undergraduate knowledge of relational
databases is highly recommended. If not, you should be willing to do a
"crash course" on the topic in the first few weeks. The recommended books
for the crash course are:
The Cow Book, or
The Complete Book.
Some knowledge of machine learning (especially supervised learning) is helpful.
If you haven't had any exposure to machine learning before, you can read up
on the topic in the first few weeks of the class. We will also cover the
most basic stuff of supervised learning in the first few lectures.
Knowing Python is helpful for the class project. If you don't know it, use
this as an excuse to soak it up this semester. It is a relatively easy language
to learn and start using. We will discuss prerequisites more in the class.
course meets twice a week to discuss research papers. You are
required to read the specified paper/textbook chapter/slides before each lecture and attend
the lectures. There will be a midterm, a final, and a project.
Midterm: Fri Mar 18, in class at usual time/room,
Final: Mon May 9, in class at usual time/room,
Other important dates: first class: Wed Jan 20,
Spring break: Mar 19 - Mar 27, last class: Fri May 6.
Grade: Midterm: 30%, final: 30%, project: 40%.
Slides presented in the class, in chronological order.
Course schedule and the paper list are below (may be revised slightly
as the course progresses). Each paper will be covered in 1-2
lectures. Some topics below refer to chapters in a
textbook. I will email
the scanned copies of these chapters. Slides for the chapters are
available on the book's website.
The Big Picture and Preliminaries
Data Acquisition and Pre-processing
- What are the buzzwords out there and how they fit together? Big data, data science, NoSQL,
crowdsourcing, social media, cloud computing, what else? Course focus and outline.
- RDBMSs: the key ideas, the state of the art, the need to go beyond RDBMSs, motivation for Big Data and NoSQL.
- Big Data: Beckman Report on Database Research Self-Assessment Read the introduction
and scan the rest.
- Brief introduction to machine learning.
- Data science (getting value out of data):
- How statisticians view data science (misc/extra materials):
- Data acquisition, knowing the sources, knowing their qualities. Data lake.
- Extraction (aka extracting structured values out of unstructured data):
- Types of extraction: Wrapper and IE (information extraction)
- Wrapper construction (Chapter 9): Read 9.1, 9.2, 9.3.1, 9.4, 9.5.2.
You can find slides for wrapper construction on the
- Case studies:
- IE from text
- Case studies:
- Data understanding, cleaning, and transforming
Data Exploration and Analysis
Beyond Data Science
- Schema matching and mapping, briefly on ontology matching (Chapter 5,
slides presented in the class): Read 5.1 to 5.5, scan 5.6, read
5.7 to 5.9, scan 5.10.
- String matching (Chapter 4, slides on the book website): Read 4.1,
4.2.1 (only "Edit Distance"), 4.2.2 (only "Overlap", "Jaccard", and
"TF/IDF"), 4.2.4, 4.3 (only "Inverted Index" and "Size Filtering").
- Data matching (Chapter 7, slides on the book website): Read 7.1,
7.2, 7.3, 7.4, 7.5.1, 7.5.2, 7.5.3, 7.6 (only the preamble and 7.6.1).
- Main data integration approaches: materialized (ETL, data
warehousing), federated, virtual (GAV, LAV): read Chapter 1.
Potentially interesting stuff. I haven't read these carefully.
Students will form 2-person teams for a multi-stage project that addresses
a data science problem. Will discuss in the class.