We have chosen two of the most commonly used Restaurant review websites www.zomato.com and www.yelp.com to extract HTML data. Datasets provided by the two websites for academic purposes can be used for this project.
Please find below the links to the folders containing the collection of the HTML pages crawled from Zomato and Yelp respectively.
BeautifulSoup Python library was used to pull data out of HTML web pages.
The logic for the Crawler was designed and built using Python
Please find below the links to blocking_explanation.pdf, the IPython file and the table obtained after blocking on different attributes - tableC.csv.
We carried out blocking in parallel on three attributes and then performed
a conjunction operation on the three candidate sets to get the final tableC.csv
Please find below the links to Golden Data, Matching Explanation.pdf, the IPython file and the table obtained after matching.
We carried out multiple iterations and analysis to find the best machine learning algorithm for matching two restaurant entities from Yelp and Zomato.
Based on Precision, Recall and F1 measures Naive Bayes proved to be the best machine learning algorithm to be used to match restaurant entities.