Movie Matcher- Amazon vs Rotten Tomatoes

Introduction

Goal- Data Science Pipeline Implementation

Team Members

Deliverables

Sites Chosen

Stage 1- Extraction

Selected Amazon.com and RottenTomatoes.com as the two websited to be used for the project.

Developed crawlers to crawl and extract movie attributes for the genres : Action , Drama.

Extracted movie attributes of Amazon are in TableA and attributes of Rotten Tomatoes are in TableB.

HTML Pages CSV Files
Amazon Table A
Rotten Tomatoes Table B

Stage 2- Blocking

Performed blocking on the tables obtained in stage 1

Applied overlap blocker , Rule based blocker , Blackbox blocker. Refer to the blocking explanation document for further details.

Blocked Tables - TableC
Blocking Explanation
Blocking iPython File

Stage 3- Matching

Manually created golden data based by sampling 450 tuples from the table obtained in stage 2

Created feature vectors and used them to determine the best Matcher using cross validation. Refer to the matching explanation document for further details

Golden Data- Table G
Final Matched Tables
Matching Explanation
Matching iPython File

Final File Submission

HTML Pages CSV Files
Original input tables Table A Table B
Input tables with error fixes Table A
Labeled data without error fixes -
Labeled data with error fixes Table G
Matching iPython file ipython file
Development Set -
Evaluation Set -
Final Matches Final matches

Bonus Points

Report for bonus points report