========
Blocking
========

Our tableA had ~9K tuples and tableB had ~17K tuples. The cartesian product of
both these tables would result in a total of 9K * 17K = 153M tuples which is a
huge number of tuples. Therefore, we perform blocking to reduce the number of
tuples that will be there in tableC (the output of the blocking stage).

Blocking based on year:
=======================
The first strategy we used was to block the movies in Google Play and iTunes
based on the "year" the movie was released. Thus, only the movies that were
released in the same year were compared from both the tables to form the output
tableC, which contained the tuples that will most likely match.

Number of tuples in tableC after blocking using the year is ~9M.

Given that the smaller of the sources (Google Play) has only ~9K tuples, we can
only hope to have ~9K matches in the candidate tuple pairs. However, this gives
us a best-case match rate of 1 tuple-pair in every ~1K candidate tuple pairs.
This would mean that we should generate golden data in the order of 1K (or more)
to test the accuracy of our matching algorithm. Therefore, we decided to
augument this strategy by using the length of the movie title.

Blocking based on year and length of the title:
===============================================
In the second strategy, we used both the year and the length of the movie's
title for blocking. In addtion to discarding candidate pairs which differ in
the year of release, we also discarded movies with title lengths that differ by
more than some number of characters (~5).

Number of tuples in tableC after blocking using the year is ~3.6M. Our new
blocking strategy helped us to reduce the tuples that matched by approximately
one-third of the candidate tuple-pairs from the previous strategy involving just
year.

Implementation Details:
=======================
We used an in-memory database SQLite3 for performing blocking. We created an
index on tableA using the year and the length of the title, and then we probed
the index using the attribute year in tableB to match with the movies with the
same year in the index and we also made sure that the length of the movie's
title is within a specified range. We used a range of 10 (i.e., the two titles
that we are comparing can differ only by a value of +/- 5).

Tuples with year as NULL:
-------------------------
The strategy we used for blocking the tuples with year as NULL is to compare it
with all the tuples in the other table. By doing this we made sure that we don't
miss any correct matches just because the year attribute wasn't specified for
them.

Title length calculation:
-------------------------
Movies often have year of release at the end of the title to differentiate
between re-makes of the same movie or different movies with the same title.
However, this is not always done. For example, Scarface (1983) movie appears
as 'Scarface' in Google Play and 'Scarface (1983)' in iTunes. Thus, this
candidate pair would have been ignored by our blocking strategy (since their
title lengths differ by more than 5 characters). To overcome this issue, we
strip off the year in the title of movies using regular expression before
indexing and probing based on movie lengths.