Fan Ding

Qing Li

fding5@wisc.edu

qing.li@wisc.edu

In this project, electronic products is chosen as our study domain. We select two Web sources (one from Amazon, the other from BestBuy), crawl to retrieve HTML data, perform information extraction to convert the HTML data into two relational tables. Next, we use Magellan, a data matching system develped at Wisconsin, to do the blocking and matching for the two tables.

Data Source: Amazon.com, Bestbuy.com
Crawler: scrapy

Stage 1 Delivery

This stage is to select two Web sites that list data, use scrapy to crawl HTML data, and convert into two relational tables.

Amazon Electronic Products
Amazon Table CSV File
Amazon Table Related HTMLs
Attributes: id, name, amazon price, original price, features, url

BestBuy Electronic Products
BestBuy Table CSV File
BestBuy Table Related HTMLs
Attributes: id, name, price, description, features, url



Stage 2 Delivery

This stage is to perform blocking methods using Magellan on the two tables A and B.

Blocking explanation
Candidate table
Ipython file


Stage 3 Delivery

This stage is to create golden data and find the best matcher using Magellan.

Matching explanation
Golden table
Ipython file


Final File Submission

This part contains all the files in Stage 3.

Original Input Tables: Table A, Table B
Input Tables with Error Fixes: Table A, Table B
Labeled Data without Error Fixes:Golden table
Labeled Data with Error Fixes:Golden table
Ipython File for Matching Pipeline: Ipython File
Development Set (After Rerun, and is different from report): Development Set
Evaluation Set (After Rerun, and is different from report): Evaluation Set
Final Matches (After Rerun, and is different from report): Final Matches


Report for Bonus Points
Recommendation Report: Report for Bonus Points


Last Updated : December 5th 2015