Data Matching Project

Description:

In this project, we are trying to match restaurants from two different web sources.
Thousands of restaurant data of ten most populated US cities is crawled from Yelp ( ~ 3000 pages) and Yellow Pages ( ~ 9000 pages).
Two csv files (tables) are then generated for further processing in further stages. Using the csv files, a candidate set is obtained by blocking.

Sources:

Yelp:

Yellow Pages:

Blocking:

Blocking explanation can be found here.

Matching:

The Matching explanation is available here.

Golden Set G:

Golden Table (original)

ipython code:

Matching.ipynb

Misc:

Relabled Golden Table Enhanced Candidate Table

Final File Submission:

Blocking

Original	Revised
Yelp Table	Revised Yelp Table
YP Table	Revised YP Table

Matching

Original	Revised
Candidate Table	Candidate Table (enhanced)
Golden Table (original)	Golden Table (relabel)
Development Set	Development Set (relabel)
Evaluation Set	Evaluation Set (relabel)

Bonus:

Magellan User Report/Survey

Acknowledgement

Special thanks to Prof. AnHai Doan and Pradap Konda

Project: Data Matching

Topic: Restaurants

Team: Clarence Cheung, Jin Ruan

Description:

Sources:

Yelp:

Yellow Pages:

Blocking:

Blocking explanation can be found here.

Candidate set:

ipython code:

Misc:

Matching:

The Matching explanation is available here.

Golden Set G:

ipython code:

Misc:

Final File Submission:

Blocking

Matching

Bonus:

Magellan User Report/Survey

Acknowledgement