Domain chosen: Restaurant review websites.
This page contains the deliverables for the CS784 project.
We have chosen two of the most commonly used Restaurant review websites www.zomato.com and www.yelp.com to extract HTML data. Datasets provided by the two websites for academic purposes can be used for this project.
Please find below the links to the folders containing the collection of the HTML pages crawled from Zomato and Yelp respectively.
BeautifulSoup Python library was used to pull data out of HTML web pages.
The logic for the Crawler was designed and built using Python
Please find below the links to the tables, tableA.csv and tableB.csv that are tabulated data for the data crawled from Zomato and Yelp respectively.
The timestamp of each data entry is considered as the ID for each tuple
Please find below the links to blocking_explanation.pdf, the IPython file and the table obtained after blocking on different attributes - tableC.csv.
We carried out blocking in parallel on three attributes and then performed
a conjunction operation on the three candidate sets to get the final tableC.csv
Please find below the links to Golden Data, Matching Explanation.pdf, the IPython file and the table obtained after matching.
We carried out multiple iterations and analysis to find the best machine learning algorithm for matching two restaurant entities from Yelp and Zomato.
Based on Precision, Recall and F1 measures Naive Bayes proved to be the best machine learning algorithm to be used to match restaurant entities.
Please find below the links to Development Set, Evaluation Set and the Labelled Data without error fixes.