======== Blocking ======== Our tableA had ~9K tuples and tableB had ~17K tuples. The cartesian product of both these tables would result in a total of 9K * 17K = 153M tuples which is a huge number of tuples. Therefore, we perform blocking to reduce the number of tuples that will be there in tableC (the output of the blocking stage). Blocking based on year: ======================= The first strategy we used was to block the movies in Google Play and iTunes based on the "year" the movie was released. Thus, only the movies that were released in the same year were compared from both the tables to form the output tableC, which contained the tuples that will most likely match. Number of tuples in tableC after blocking using the year is ~9M. Given that the smaller of the sources (Google Play) has only ~9K tuples, we can only hope to have ~9K matches in the candidate tuple pairs. However, this gives us a best-case match rate of 1 tuple-pair in every ~1K candidate tuple pairs. This would mean that we should generate golden data in the order of 1K (or more) to test the accuracy of our matching algorithm. Therefore, we decided to augument this strategy by using the length of the movie title. Blocking based on year and length of the title: =============================================== In the second strategy, we used both the year and the length of the movie's title for blocking. In addtion to discarding candidate pairs which differ in the year of release, we also discarded movies with title lengths that differ by more than some number of characters (~5). Number of tuples in tableC after blocking using the year is ~3.6M. Our new blocking strategy helped us to reduce the tuples that matched by approximately one-third of the candidate tuple-pairs from the previous strategy involving just year. Implementation Details: ======================= We used an in-memory database SQLite3 for performing blocking. We created an index on tableA using the year and the length of the title, and then we probed the index using the attribute year in tableB to match with the movies with the same year in the index and we also made sure that the length of the movie's title is within a specified range. We used a range of 10 (i.e., the two titles that we are comparing can differ only by a value of +/- 5). Tuples with year as NULL: ------------------------- The strategy we used for blocking the tuples with year as NULL is to compare it with all the tuples in the other table. By doing this we made sure that we don't miss any correct matches just because the year attribute wasn't specified for them. Title length calculation: ------------------------- Movies often have year of release at the end of the title to differentiate between re-makes of the same movie or different movies with the same title. However, this is not always done. For example, Scarface (1983) movie appears as 'Scarface' in Google Play and 'Scarface (1983)' in iTunes. Thus, this candidate pair would have been ignored by our blocking strategy (since their title lengths differ by more than 5 characters). To overcome this issue, we strip off the year in the title of movies using regular expression before indexing and probing based on movie lengths.