Movie Matcher- Amazon vs Rotten Tomatoes

Introduction

Goal- Data Science Pipeline Implementation

Team Members

Mushahid Alam
Shreya Kamath

Deliverables

Develop crawlers to collect raw HTML data from two websites that operate in the same domain
Develop wrappers to extract the relevant movie attributes from the raw HTML data . Create a csv file with the extracted attributes
Perform blocking on the two csv files in order to eliminate the obvious non matches from the combines table
Manually create golden data by randomly sampling the blocked tables
Use the golden data to train a set of matchers. Use cross validation to determine the precision , recall and F1 measure of the matchers
Based on the above determine which is the best matcher

Sites Chosen

Amazon

Rotten Tomatoes

Stage 1- Extraction

Selected Amazon.com and RottenTomatoes.com as the two websited to be used for the project.

Developed crawlers to crawl and extract movie attributes for the genres : Action , Drama.

Extracted movie attributes of Amazon are in TableA and attributes of Rotten Tomatoes are in TableB.

HTML Pages	CSV Files
Amazon	Table A
Rotten Tomatoes	Table B

Stage 2- Blocking

Performed blocking on the tables obtained in stage 1

Applied overlap blocker , Rule based blocker , Blackbox blocker. Refer to the blocking explanation document for further details.

Blocked Tables - TableC
Blocking Explanation
Blocking iPython File

Stage 3- Matching

Manually created golden data based by sampling 450 tuples from the table obtained in stage 2

Created feature vectors and used them to determine the best Matcher using cross validation. Refer to the matching explanation document for further details

Golden Data- Table G
Final Matched Tables
Matching Explanation
Matching iPython File

Final File Submission

HTML Pages	CSV Files
Original input tables	Table A Table B
Input tables with error fixes	Table A
Labeled data without error fixes	-
Labeled data with error fixes	Table G
Matching iPython file	ipython file
Development Set	-
Evaluation Set	-
Final Matches	Final matches

Bonus Points

Report for bonus points	report