V for Ven(data)

Google Play is a digital distribution platform operated by Google. In addition to serving as the app store for the Android operating system, it serves as a digital media store selling music, movies, television shows, books, and magazine.

iTunes Store is a digital media store operated by Apple Inc. selling music, movies, television shows, audiobooks, etc. Apple claims that the iTunes Store is the most popular movie store containing over 45,000 films.

Other sources considered

README

Google Play

HTML Files: Movies Movie listings All
JSON Data: tableA.json
Sources: Spider Movie Item Convert All
Number of Tuples: 9233

Extracted Attributes

Attribute	Type	Description
`id`	TEXT	The unique identifier of the movie in the Google Play database.
`name`	TEXT	Name of the movie.
`year`	INTEGER	Year of the movie release.
`genre`	TEXT	Genre of movie, e.g., Romance, Comedy, etc.
`description`	TEXT	Short description about the movie or the storyline.
`actors`	TEXT	The list of lead actors in the movie.
`writers`	TEXT	The list of screen-writers of the movie.
`producers`	TEXT	The list of producers of the movie.
`directors`	TEXT	The list of directors of the movie.
`content_rating`	TEXT	Motion Picture Association of America file rating. Suitability of the movie to the audience, e.g., PG-13)
`rating`	TEXT	A rating on a five-star scale given for the movie
`price`	TEXT	The cost to purchase the movie.
`offer_type`	TEXT	The type of offer available for the movie, e.g., Buy HD, Rent HD, etc.
`all_offers`	TEXT	All the available offers with prices and offer type for each.
`similar_movies_id`	TEXT	The list of movie identifiers of similar movies.

iTunes Store

HTML Files: Movies Movie listings Page listings All
JSON Data: tableB.json
Sources: Spider Movie Item Convert All
Number of Tuples: 17341

Extracted Attributes

Attribute	Type	Description
`id`	TEXT	The unique identifier of the movie in the iTunes Store database.
`name`	TEXT	Name of the movie.
`year`	INTEGER	Year of the movie release.
`month`	TEXT	Month of the movie release.
`genre`	TEXT	Genre of movie, e.g., Romance, Comedy, etc.
`description`	TEXT	Short description about the movie or the storyline.
`price`	TEXT	The cost to purchase the movie.
`actors`	TEXT	The list of lead actors in the movie.
`writers`	TEXT	The list of screen-writers of the movie.
`producers`	TEXT	The list of producers of the movie.
`directors`	TEXT	The list of directors of the movie.
`content_rating`	TEXT	Motion Picture Association of America file rating. Suitability of the movie to the audience, e.g., PG-13.
`rating`	TEXT	A rating on a five-star scale given for the movie.
`number_of_ratings`	TEXT	The number of people who have rated the movie.
`similar_movies_id`	TEXT	The list of movie identifiers of similar movies.
`rotten_tomatoes_tomatometer`	TEXT	The fraction of positive reviews in Rotten Tomatoes website.
`rotten_tomatoes_average_rating`	TEXT	The average rating in Rotten Tomatoes website.

README: Blocking Explanation
CSV data: Fragment All
Sources: Blocking

Our tableA had ~9K tuples and tableB had ~17K tuples. The cartesian product of both these tables would result in a total of 9K * 17K = 153M tuples which is a huge number of tuples. We had 3 iterations of blocking strategy. In the first attempt, we used year of movie release attribute for pruning unwanted movie pairs. In the second attempt, we added movie name length filtering to first attempt. In the final blocking strategy, we removed year of release based filtering, and added a strategy to generate candidate pairs with at least one word in common in the movie name.

Blocking based on year

The first strategy we used was to block the movies in Google Play and iTunes based on the "year" the movie was released. Thus, only the movies that were released in the same year were compared from both the tables to form the output tableC, which contained the tuples that will most likely match.

Number of tuples in tableC after blocking using the year is ~9M.

Given that the smaller of the sources (Google Play) has only ~9K tuples, we can only hope to have ~9K matches in the candidate tuple pairs. However, this gives us a best-case match rate of 1 tuple-pair in every ~1K candidate tuple pairs. This would mean that we should generate golden data in the order of 1K (or more) to test the accuracy of our matching algorithm. Therefore, we decided to augument this strategy by using the length of the movie title.

Blocking based on year and length of the title

In the second strategy, we used both the year and the length of the movie's title for blocking. In addtion to discarding candidate pairs which differ in the year of release, we also discarded movies with title lengths that differ by more than some number of characters (~5).

Number of tuples in tableC after blocking using this strategy is ~3.6M. Our new blocking strategy helped us to reduce the tuples that matched by approximately one-third of the candidate tuple-pairs from the previous strategy involving just year. However, even under this reduction, we can only hope for 1 true tuple pair among every ~330 candidate tuple pairs.

Blocking based on common words in title and length of the title

After getting poor blocking results, we did an emperical study on a popular movie list. We chose top 100 popular movies in IMDb, and matched them manually in Google Play and iTunes to see if our blocking strategies made sense. We saw that only 33% of movies were present in both the sources. More importantly, we saw that blocking strategy using year of release missed out on close to 30% of movie matches, since one of the sources got the year of release wrong. We also saw that filtering based on movie name length worked well provided we removed unwanted information at the end of movie name.

With observations from the empirical study, we implemented a new strategy where we tokenized the movie title based on words, removed stop words, normalized the words to lower case letters, and normalized Unicode characters to their simpler ASCII variants. We then employed an inverted index of normalized word tokens to movies, and computed candidate movie pairs that have at least one token in common. We additionally used name length filtering to prune more unwanted candidate pairs.

Number of tuples in tableC using this blocking strategy is ~3.2M. Although this number is very close to previous strategy, this was more likely to not miss out on correctly matching pairs in blocking phase.

Implementation Details

We used an in-memory database SQLite3 for performing blocking. We created an index on tableA using the year and the length of the title, and then we probed the index using the attribute year in tableB to match with the movies with the same year in the index and we also made sure that the length of the movie's title is within a specified range. We used a range of 10 (i.e., the two titles that we are comparing can differ only by a value of +/- 5).

Tuples with year as NULL

The strategy we used for blocking the tuples with year as NULL is to compare it with all the tuples in the other table. By doing this we made sure that we don't miss any correct matches just because the year attribute wasn't specified for them.

Title length calculation

Movies often have year of release at the end of the title to differentiate between re-makes of the same movie or different movies with the same title. However, this is not always done. For example, Scarface (1983) movie appears as 'Scarface' in Google Play and 'Scarface (1983)' in iTunes. Thus, this candidate pair would have been ignored by our blocking strategy (since their title lengths differ by more than 5 characters). To overcome this issue, we strip off the year in the title of movies using regular expression before indexing and probing based on movie lengths.

Golden Data

Data: CSV
Tuples: 3000
Sources: Tagging Golden Data

Interface for tagging golden data:
Golden data tagger

Features

Year Match: Exact match google-play:year with itunes:year
Name 3-gram: TRIGRAM match of google-play:name with itunes:name
Actor 3-gram: TRIGRAM match of google-play:actors with itunes:actors
Director 3-gram: TRIGRAM match of google-play:directors with itunes:directors
Name Levenshtein: Edit distance of google-play:name with itunes:name

Rules

Year Match == 1.0 AND Name 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.55 AND Directors 3-gram > 0.9
Name 3-gram > 0.9 AND Directors 3-gram > 0.9

Precision: 1.00
Recall: 0.91

Actions Log

Time spent on arriving at rules: 3hrs.

Attempt 1

Add rule 1.

Year Match == 1.0 AND Name 3-gram > 0.8

Precision: 1.00
Recall: 0.65

Attempt 2

Replace rule 1 in attempt 1 with rule 1 here.

Year Match == 1.0 AND Name Levenshtein > 0.8

Precision: 1.00
Recall: 0.67

Attempt 3

Add rule 2 to attempt 1.

Year Match == 1.0 AND Name 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.8

Precision: 1.00
Recall: 0.81

Attempt 4

Add rule 3 to attempt 3.

Year Match == 1.0 AND Name 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.8
Name 3-gram > 0.6 AND Directors 3-gram > 0.8

Precision: 0.91
Recall: 0.93

Attempt 5

Replace rule 3 in attempt 4 with rule 3 here.

Year Match == 1.0 AND Name 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.75 AND Directors 3-gram > 0.8

Precision: 1.00
Recall: 0.81

Attempt 6

Replace rule 3 in attempt 5 with rule 3 here.

Year Match == 1.0 AND Name 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.5 AND Directors 3-gram > 0.9

Precision: 0.97
Recall: 0.88

Attempt 7

Replace rule 3 in attempt 6 with rule 3 here.

Year Match == 1.0 AND Name 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.55 AND Directors 3-gram > 0.9

Precision: 1.00
Recall: 0.86

Attempt 8

Add rule 4 to attempt 7.

Year Match == 1.0 AND Name 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.8
Name 3-gram > 0.6 AND Actor 3-gram > 0.55 AND Directors 3-gram > 0.9
Name 3-gram > 0.9 AND Directors 3-gram > 0.9

Precision: 1.00
Recall: 0.91

Matching Experience with EMS

We followed the steps for matching using EMS as described by Sanjib and as written in our course webpage. We found the matching process to be very straight forward. We did not face any major issues while using EMS. We wrote some rules using the predefined functions like 3-gram, Jaccard, Levenshtein, etc., and started calculating the precision and recall. Finally, we were able to achieve a precision of 1 and a recall of 0.91.

Some minor issues we faced while using EMS are:

We weren't able to edit the rules using the "Edit Rule" option.
The predefined functions were case-sensitive while computing similarity scores.

Our suggestions for the improvement of EMS:

Adding simple pre-processing functions for each attribute in the feature definition before similarity score computation. For example, the following pre-processing functions would have been useful:
- Converting attributes to lower-case (or upper-case).
- More generally, allowing a regular expression (or set of regular expressions) to be run on attribute values. (Use cases: Some movie titles might have the year of release after the name in a standard format -- inside paranthesis -- removing which will boost accuracy of similarity score. Convert telephone numbers in different formats into a single canonical format.)
- Normalize Unicode characters with accents to simple ASCII versions. Some sources pay special attention to using correct Unicode characters to describe name, while some other sources use simple ASCII characters without accents. (For example, the movie Amélie might be described as Amelie.)

Overall, we had a great experience with EMS and feel that it would be really helpful for data scientists or students learning data integration, if it was open-sourced.

V for Ven(data)

Sources

Other sources considered

Data Extraction

Google Play

Extracted Attributes

iTunes Store

Extracted Attributes

Blocking

Blocking based on year

Blocking based on year and length of the title

Blocking based on common words in title and length of the title

Implementation Details

Tuples with year as NULL

Title length calculation

Matching

Golden Data

Features

Rules

Actions Log

Attempt 1

Attempt 2

Attempt 3

Attempt 4

Attempt 5

Attempt 6

Attempt 7

Attempt 8

Matching Experience with EMS