CS 784 Project

Fan Ding	Qing Li
fding5@wisc.edu	qing.li@wisc.edu

In this project, electronic products is chosen as our study domain. We select two Web sources (one from Amazon, the other from BestBuy), crawl to retrieve HTML data, perform information extraction to convert the HTML data into two relational tables. Next, we use Magellan, a data matching system develped at Wisconsin, to do the blocking and matching for the two tables.

Data Source: Amazon.com, Bestbuy.com
Crawler: scrapy

Stage 1 Delivery

This stage is to select two Web sites that list data, use scrapy to crawl HTML data, and convert into two relational tables.

Amazon Electronic Products
Amazon Table CSV File
Amazon Table Related HTMLs
Attributes: id, name, amazon price, original price, features, url

BestBuy Electronic Products
BestBuy Table CSV File
BestBuy Table Related HTMLs
Attributes: id, name, price, description, features, url

Stage 2 Delivery

This stage is to perform blocking methods using Magellan on the two tables A and B.
Blocking explanation
Candidate table
Ipython file

Stage 3 Delivery

This stage is to create golden data and find the best matcher using Magellan.
Matching explanation
Golden table
Ipython file

Final File Submission

This part contains all the files in Stage 3.
Original Input Tables: Table A, Table B
Input Tables with Error Fixes: Table A, Table B
Labeled Data without Error Fixes:Golden table
Labeled Data with Error Fixes:Golden table
Ipython File for Matching Pipeline: Ipython File
Development Set (After Rerun, and is different from report): Development Set
Evaluation Set (After Rerun, and is different from report): Evaluation Set
Final Matches (After Rerun, and is different from report): Final Matches

Report for Bonus Points

Recommendation Report: Report for Bonus Points

Last Updated : December 5th 2015

Fan Ding

Qing Li

fding5@wisc.edu

qing.li@wisc.edu

Stage 1 Delivery

Stage 2 Delivery

Stage 3 Delivery

Final File Submission

Report for Bonus Points