Web Crawler


Domain chosen: Restaurant review websites.

This page contains the deliverables for the CS784 project.

Find Out More

Websites Chosen


We have chosen two of the most commonly used Restaurant review websites www.zomato.com and www.yelp.com to extract HTML data. Datasets provided by the two websites for academic purposes can be used for this project.

HTML Repository


Please find below the links to the folders containing the collection of the HTML pages crawled from Zomato and Yelp respectively.

BeautifulSoup Python library was used to pull data out of HTML web pages.
The logic for the Crawler was designed and built using Python


Zomato HTML Files                Yelp HTML Files

Tabulated Data


Please find below the links to the tables, tableA.csv and tableB.csv that are tabulated data for the data crawled from Zomato and Yelp respectively.

The timestamp of each data entry is considered as the ID for each tuple

Zomato's data in CSV format                Yelp's data in CSV format

Phase - 2


Please find below the links to blocking_explanation.pdf, the IPython file and the table obtained after blocking on different attributes - tableC.csv.

We carried out blocking in parallel on three attributes and then performed
a conjunction operation on the three candidate sets to get the final tableC.csv


blocking_explanation.pdf                IPython file                tableC.csv               

Phase - 3


Please find below the links to Golden Data, Matching Explanation.pdf, the IPython file and the table obtained after matching.
We carried out multiple iterations and analysis to find the best machine learning algorithm for matching two restaurant entities from Yelp and Zomato.
Based on Precision, Recall and F1 measures Naive Bayes proved to be the best machine learning algorithm to be used to match restaurant entities.


Golden Data Matching Explanation IPython file Final Matches

Final File Submission


Please find below the links to Development Set, Evaluation Set and the Labelled Data without error fixes.


Development Set       Evaluation Set       Labelled Data        

All Files Report For Bonus Points      

Our Team



Ashish Shenoy ashenoy@cs.wisc.edu

Shruthi Racha shruthir@cs.wisc.edu