One-stop place to search for movies to rent or stream. Currently, supporting Google Play and iTunes Store.
Google Play is a digital distribution platform operated by Google. In addition to serving as the app store for the Android operating system, it serves as a digital media store selling music, movies, television shows, books, and magazine.
iTunes Store is a digital media store operated by Apple Inc. selling music, movies, television shows, audiobooks, etc. Apple claims that the iTunes Store is the most popular movie store containing over 45,000 films.
Attribute | Type | Description |
---|---|---|
id |
TEXT | The unique identifier of the movie in the Google Play database. |
name |
TEXT | Name of the movie. |
year |
INTEGER | Year of the movie release. |
genre |
TEXT | Genre of movie, e.g., Romance, Comedy, etc. |
description |
TEXT | Short description about the movie or the storyline. |
actors |
TEXT | The list of lead actors in the movie. |
writers |
TEXT | The list of screen-writers of the movie. |
producers |
TEXT | The list of producers of the movie. |
directors |
TEXT | The list of directors of the movie. |
content_rating |
TEXT | Motion Picture Association of America file rating. Suitability of the movie to the audience, e.g., PG-13) |
rating |
TEXT | A rating on a five-star scale given for the movie |
price |
TEXT | The cost to purchase the movie. |
offer_type |
TEXT | The type of offer available for the movie, e.g., Buy HD, Rent HD, etc. |
all_offers |
TEXT | All the available offers with prices and offer type for each. |
similar_movies_id |
TEXT | The list of movie identifiers of similar movies. |
Attribute | Type | Description |
---|---|---|
id |
TEXT | The unique identifier of the movie in the iTunes Store database. |
name |
TEXT | Name of the movie. |
year |
INTEGER | Year of the movie release. |
month |
TEXT | Month of the movie release. |
genre |
TEXT | Genre of movie, e.g., Romance, Comedy, etc. |
description |
TEXT | Short description about the movie or the storyline. |
price |
TEXT | The cost to purchase the movie. |
actors |
TEXT | The list of lead actors in the movie. |
writers |
TEXT | The list of screen-writers of the movie. |
producers |
TEXT | The list of producers of the movie. |
directors |
TEXT | The list of directors of the movie. |
content_rating |
TEXT | Motion Picture Association of America file rating. Suitability of the movie to the audience, e.g., PG-13. |
rating |
TEXT | A rating on a five-star scale given for the movie. |
number_of_ratings |
TEXT | The number of people who have rated the movie. |
similar_movies_id |
TEXT | The list of movie identifiers of similar movies. |
rotten_tomatoes_tomatometer |
TEXT | The fraction of positive reviews in Rotten Tomatoes website. |
rotten_tomatoes_average_rating |
TEXT | The average rating in Rotten Tomatoes website. |
Our tableA
had ~9K tuples and tableB
had
~17K tuples. The cartesian product of both these tables would result
in a total of 9K * 17K = 153M tuples which is a huge number of
tuples. We had 3 iterations of blocking strategy. In the first
attempt, we used year of movie release attribute for pruning unwanted
movie pairs. In the second attempt, we added movie name length
filtering to first attempt. In the final blocking strategy, we
removed year of release based filtering, and added a strategy to
generate candidate pairs with at least one word in common in the
movie name.
The first strategy we used was to block the movies in Google Play
and iTunes based on the "year" the movie was released. Thus, only the
movies that were released in the same year were compared from both
the tables to form the output tableC
, which contained
the tuples that will most likely match.
Number of tuples in tableC after blocking using the year is ~9M.
Given that the smaller of the sources (Google Play) has only ~9K tuples, we can only hope to have ~9K matches in the candidate tuple pairs. However, this gives us a best-case match rate of 1 tuple-pair in every ~1K candidate tuple pairs. This would mean that we should generate golden data in the order of 1K (or more) to test the accuracy of our matching algorithm. Therefore, we decided to augument this strategy by using the length of the movie title.
In the second strategy, we used both the year and the length of the movie's title for blocking. In addtion to discarding candidate pairs which differ in the year of release, we also discarded movies with title lengths that differ by more than some number of characters (~5).
Number of tuples in tableC after blocking using this strategy is ~3.6M. Our new blocking strategy helped us to reduce the tuples that matched by approximately one-third of the candidate tuple-pairs from the previous strategy involving just year. However, even under this reduction, we can only hope for 1 true tuple pair among every ~330 candidate tuple pairs.
After getting poor blocking results, we did an emperical study on a popular movie list. We chose top 100 popular movies in IMDb, and matched them manually in Google Play and iTunes to see if our blocking strategies made sense. We saw that only 33% of movies were present in both the sources. More importantly, we saw that blocking strategy using year of release missed out on close to 30% of movie matches, since one of the sources got the year of release wrong. We also saw that filtering based on movie name length worked well provided we removed unwanted information at the end of movie name.
With observations from the empirical study, we implemented a new strategy where we tokenized the movie title based on words, removed stop words, normalized the words to lower case letters, and normalized Unicode characters to their simpler ASCII variants. We then employed an inverted index of normalized word tokens to movies, and computed candidate movie pairs that have at least one token in common. We additionally used name length filtering to prune more unwanted candidate pairs.
Number of tuples in tableC using this blocking strategy is ~3.2M. Although this number is very close to previous strategy, this was more likely to not miss out on correctly matching pairs in blocking phase.
We used an in-memory database SQLite3 for performing blocking. We created an index on tableA using the year and the length of the title, and then we probed the index using the attribute year in tableB to match with the movies with the same year in the index and we also made sure that the length of the movie's title is within a specified range. We used a range of 10 (i.e., the two titles that we are comparing can differ only by a value of +/- 5).
The strategy we used for blocking the tuples with year as NULL is to compare it with all the tuples in the other table. By doing this we made sure that we don't miss any correct matches just because the year attribute wasn't specified for them.
Movies often have year of release at the end of the title to differentiate between re-makes of the same movie or different movies with the same title. However, this is not always done. For example, Scarface (1983) movie appears as 'Scarface' in Google Play and 'Scarface (1983)' in iTunes. Thus, this candidate pair would have been ignored by our blocking strategy (since their title lengths differ by more than 5 characters). To overcome this issue, we strip off the year in the title of movies using regular expression before indexing and probing based on movie lengths.
Interface for tagging golden data:
google-play:year
with itunes:year
google-play:name
with itunes:name
google-play:actors
with itunes:actors
google-play:directors
with itunes:directors
google-play:name
with itunes:name
Time spent on arriving at rules: 3hrs.
Add rule 1.
Replace rule 1 in attempt 1 with rule 1 here.
Add rule 2 to attempt 1.
Add rule 3 to attempt 3.
Replace rule 3 in attempt 4 with rule 3 here.
Replace rule 3 in attempt 5 with rule 3 here.
Replace rule 3 in attempt 6 with rule 3 here.
Add rule 4 to attempt 7.
We followed the steps for matching using EMS as described by Sanjib and as written in our course webpage. We found the matching process to be very straight forward. We did not face any major issues while using EMS. We wrote some rules using the predefined functions like 3-gram, Jaccard, Levenshtein, etc., and started calculating the precision and recall. Finally, we were able to achieve a precision of 1 and a recall of 0.91.
Some minor issues we faced while using EMS are:
Our suggestions for the improvement of EMS:
Overall, we had a great experience with EMS and feel that it would be really helpful for data scientists or students learning data integration, if it was open-sourced.
© Shoban Chandrabose & Adalbert Gerald Soosai Raj, 2014.
CS784 Final Project.