Web Crawler

Domain chosen: Restaurant review websites.

This page contains the deliverables for the CS784 project.

Find Out More

Websites Chosen

We have chosen two of the most commonly used Restaurant review websites www.zomato.com and www.yelp.com to extract HTML data. Datasets provided by the two websites for academic purposes can be used for this project.

HTML Repository

Please find below the links to the folders containing the collection of the HTML pages crawled from Zomato and Yelp respectively.

BeautifulSoup Python library was used to pull data out of HTML web pages.
The logic for the Crawler was designed and built using Python

Zomato HTML Files Yelp HTML Files

Tabulated Data

Please find below the links to the tables, tableA.csv and tableB.csv that are tabulated data for the data crawled from Zomato and Yelp respectively.

The timestamp of each data entry is considered as the ID for each tuple

Zomato's data in CSV format Yelp's data in CSV format

Phase - 2

Please find below the links to blocking_explanation.pdf, the IPython file and the table obtained after blocking on different attributes - tableC.csv.

We carried out blocking in parallel on three attributes and then performed
a conjunction operation on the three candidate sets to get the final tableC.csv

blocking_explanation.pdf IPython file tableC.csv

Phase - 3

Please find below the links to Golden Data, Matching Explanation.pdf, the IPython file and the table obtained after matching.
We carried out multiple iterations and analysis to find the best machine learning algorithm for matching two restaurant entities from Yelp and Zomato.
Based on Precision, Recall and F1 measures Naive Bayes proved to be the best machine learning algorithm to be used to match restaurant entities.

Golden Data Matching Explanation IPython file Final Matches

Final File Submission

Please find below the links to Development Set, Evaluation Set and the Labelled Data without error fixes.

Development Set Evaluation Set Labelled Data

All Files Report For Bonus Points

Our Team

Ashish Shenoy ashenoy@cs.wisc.edu

Shruthi Racha shruthir@cs.wisc.edu