Goal- Data Science Pipeline Implementation
Team Members
Deliverables
Selected Amazon.com and RottenTomatoes.com as the two websited to be used for the project.
Developed crawlers to crawl and extract movie attributes for the genres : Action , Drama.
Extracted movie attributes of Amazon are in TableA and attributes of Rotten Tomatoes are in TableB.
HTML Pages | CSV Files |
---|---|
Amazon | Table A |
Rotten Tomatoes | Table B |
Performed blocking on the tables obtained in stage 1
Applied overlap blocker , Rule based blocker , Blackbox blocker. Refer to the blocking explanation document for further details.
Blocked Tables - TableC |
---|
Blocking Explanation |
Blocking iPython File |
Manually created golden data based by sampling 450 tuples from the table obtained in stage 2
Created feature vectors and used them to determine the best Matcher using cross validation. Refer to the matching explanation document for further details
Golden Data- Table G |
---|
Final Matched Tables |
Matching Explanation |
Matching iPython File |
HTML Pages | CSV Files |
---|---|
Original input tables | Table A Table B |
Input tables with error fixes | Table A |
Labeled data without error fixes | - |
Labeled data with error fixes | Table G |
Matching iPython file | ipython file |
Development Set | - |
Evaluation Set | - |
Final Matches | Final matches |
Report for bonus points | report |
---|