Log of Actions for matching video games. 1. Before we created the golden data, we realized that our initial candidate set was too large. Previously, we were only blocking on the developer of the game under the assumption that if two games shared a common developer, they could possibly be a match. We found that this assumption wasn't strong enough because some developers create lots of games (e.g. Ubisoft) so our candidate set was very big. 2. We decided to revisit blocking and attempt to reduce the size of our candidate set. We kept the assumption that two games must share a developer to be a match, but we added two more assumptions. One was that two games must share at least one word in the name in order to be a match (filtering out common words across all games). We assumed that two games can't be a match if they don't share the same platform, such as Xbox 360 or Playstation 3. While two games can have the same name, they could also be for different platforms (e.g Assassin's Creed for Xbox 360 and Assassin's Creed for Playstation 3). We stated that for a game to be a match, they must be on the same platfrom. We found that this drastically reduced the size of our candidate set because it eliminated possible matches that didn't share any common words in the title and matches that were not on the same platform, which was the majority of the tuples. 3. We then converted both of the base tables into .csv format. We ended up doing a lot of preprocessing on the data so that it was easier to work with. For example, we got rid of non-alpha numeric characters in the titles of the video games' names because those characters could potentially mess up the matching of two video games. We also ran into an issue where the id attribute needed to be the first attribute when uploading the csv files to EMS, which was quickly fixed by re-ordering the attributes. We also transformed the date strings that were provided in the data to a timestamp represented as long so we could compare the dates easier. This also allowed us to compare dates that are withing a specific range rather than simply exact match. 4. We then created the golden data by randomly sampling the candidate set and showing the tuple to the user. The user was presented with the names of the two games and the release dates of the two games and had to decide if the games were a match by typing 'yes' or 'no' into the console. 5. We then began creating rules for matching the video games. The first rule we came up with was to match games based on the exact match of the title of the game. This was obvious because if two games share the game name then they must be a match. We already knew that the two games were for the same platform, so sharing an exact name meant they must be the same game. This rule had precision equal to 1, but it had poor recall. This is because not all games in our data that were matches had the same name, some of the names were slightly different. To correct this, we decided to look at the release date. We decided that if two games were released within 4 days of each other then they must be a match. We first tested 2 days and found that we didn't get as high of recall as we hoped, so we increased the threshold to 4 days. We thought this was an appropriate assumption to make because we already know that the two games share a common word and a common developer (because of our blocking technique) and we thought that it is highly unlikely that two games would release within 4 days of each other that share a common word and developer.