Assignment 3

Due Thursday, October 11 at 11:59pm

Overview

This assignment will increase your understanding of data analytics frameworks and the factors influencing their performance.  You will create a small MapReduce-style deployment using Hadoop in Amazon EC2 and analyze its performance under different operating scenarios.  You will produce a short report detailing your performance observations and takeaways.

Learning Outcomes

After completing this assignment, students should:

  1. Be able to estimate the performance of a MapReduce-style data analytics frameworks under specific operating parameters
  2. Be able to critique the parameter choices for a specific MapReduce-style deployment, given details on the data analysis being conducted and the operating environment

Using Hadoop in EC2

Launching Instances

We have provided an Amazon Machine Image (AMI) with all the software you will need to complete this assignment pre-installed.  This image includes:

  1. An installation of Hadoop 1.0.3 in /opt/hadoop-1.0.3 (this path is stored in the HADOOP_HOME environment variable)
  2. Sample data files (7 books from Project Gutenberg) in /hadoop/books
  3. An automated cluster configuration script, config_cluster.sh, in /home/ubuntu

You should use this AMI when you launch instances.  You can find the AMI by searching for “CS838 Hadoop” or the AMI ID “ami-7f299b16” on the Community AMIs tab of the Launch Instances wizard.

You should use either medium or high-cpu medium instances for this assignment.  Medium instances cost $0.16 per-instance per-hour and high-cpu medium instances cost $0.165 per-instance per-hour in the US East Region.  The maximum cluster size you should use is 6 instances: 1 master plus 5 slaves.  It costs about $1 per hour to run a cluster of 6 instances.  We expect you will run your cluster for less than 24 hours in total, so the maximum grant usage you should incur for this assignment is $25.  You should stop (to preserve disk state) or terminate instances when you are not using them.

When you launch an instance, you should change the size of the root EBS volume to 15GB to give you more space for storing data.

Also, when you launch an instance, you should add a tag to the instance to indicate whether it is a master or slave.  The key for the tag should be “Type” and the value should be “master” or “slave”.  You should tag only one instance as a master and at least one instance as a slave.

You should start with a cluster that contains only two instances: one master and one slave.

Launching the Cluster

The automated configuration script we have provided use the EC2 command line tools.  These tools need to know the “access key” and “secret key” associated with your AWS account.  These keys are listed under the “Access Keys” tab in the “Access Credentials” section of the “Security Credentials” page (https://portal.aws.amazon.com/gp/aws/securityCredentials).  Before you run the configuration script on your master instance, you need to store the keys in the environment variables AWS_ACCESS_KEY and AWS_SECRET_KEY.  You can either export them directly on the command line on add them to the .bashrc file on the instance:

ubuntu@master$ export AWS_ACCESS_KEY=your_AWS_ACCESS_KEY_ID
ubuntu@master$ export AWS_SECRET_KEY=your_AWS_SECRET_KEY

Once you have launched a master instance and one or more slave instances, you can run the configuration script on the master instance:

ubuntu@master$ ./config_cluster.sh

Before running your cluster for the first time, you need to format the Hadoop Distributed File System (HDFS) by running the following on the master instance:

ubuntu@master$ hadoop namenode -format

Now, you can start HDFS and MapReduce:

ubuntu@master$ start-dfs.sh
ubuntu@master$ start-mapred.sh

You can verify things are running correctly by running the command jps on the master and slave(s).  On the master, you should see the processes NameNode, SecondaryNameNode, and JobTracker.  On the slave(s), you should see the processes DataNode and TaskTracker.

Working with Data

The AMI we provided includes sample data files (7 books from Project Gutenberg) in the /hadoop/books directory.  However, before we can run MapReduce jobs using this data, we need to the import the data into the Hadoop Distributed File System (HDFS):

ubuntu@master$ hadoop dfs -copyFromLocal /hadoop/books/ /books

You can confirm the data was imported successfully by listing the files in HDFS:

ubuntu@master$ hadoop dfs -ls /books

To recursively remove data from HDFS:

ubuntu@master$ hadoop dfs -rmr /books

To cat a single file in HDFS:

ubuntu@master$ hadoop dfs -cat /books/pg132.txt | less

You change the replication level of data already existing with HDFS, using the -setrep option, along with -w to specify the level of replication.  To do so recursively, include the -R option.  By default, data is replicated according to the value of the dfs.replication property in hdfs-site.xml.

ubuntu@master$ hadoop dfs -setrep -w 3 /books
ubuntu@master$ hadoop dfs -setrep -R -w 3 /books

If something goes wrong, or when you change a parameter in hdfs-site.xml (discussed below), you can erase all data in HDFS and from scratch.  You should make sure both HDFS and MapReduce are stopped before performing the format.

ubuntu@master$ stop-mapred.sh
ubuntu@master$ stop-dfs.sh
ubuntu@master$ hadoop namenode -format

Running a Data Analysis

Hadoop includes several example jobs that we will use.  

You can run the wordcount example as follows:

ubuntu@master$ hadoop jar $HADOOP_HOME/hadoop-examples-1.0.3.jar wordcount /books /output/testrun

Note that the output directory must not already exist in HDFS.  You can either use a different output directory every time you run a MapReduce job, or you can delete the output from HDFS:

ubuntu@mster$ hadoop dfs -rmr /output/testrun

Hadoop outputs its status and counters on STDERR.  To save this output to a file (in the regular file system, not HDFS) while simultaneously seeing it in your shell session, you can use the tee command:

ubuntu@master$ hadoop jar $HADOOP_HOME/hadoop-examples-1.0.3.jar wordcount /books /output/testrun 2>&1 | tee testrun.log

Additional information on the job can be access after it completes.  You should specify the same output directory you used when you ran the job.

ubuntu@master$ hadoop job -history all /output/testrun

An even better example is TeraSort.  

First, you need to generate data using teragen.  The parameters are how many 100byte records to generate and where to store the data.  This generates about 1GB of data:

ubuntu@master$ hadoop jar $HADOOP_HOME/hadoop-examples-1.0.3.jar teragen 10000000 /teradata

Second, you need to sort the data using terasort.  The parameters are the directory containing the input data and the directory to store the output data.  This runs terasort on the data we just generated:

ubuntu@master$ hadoop jar $HADOOP_HOME/hadoop-examples-1.0.3.jar terasort /teradata /terasorted

Optionally, you can verify the stort using teravalidate:

ubuntu@master$ hadoop jar $HADOOP_HOME/hadoop-examples-1.0.3.jar teravalidate /terasorted /teravalidated

After each of the jobs completes, you can view more details about its execution:

ubuntu@master$ hadoop job -history /teradata
ubuntu@master$ hadoop job -history /terasorted
ubuntu@master$ hadoop job -history /teravalidated

Experimenting with MapReduce

Once you have a basic cluster running, you should use the configurations listed below to answer each of the questions.  To answer the questions, you will need to you to run MapReduce jobs with different sized clusters (up to 5 slaves), different sized instances (medium and high-cpu medium), different HDFS parameters, and different MapReduce parameters.  The Hadoop Configuration section that follows will help you with this process.  You should use the  details reported by hadoop job -history to guide your analysis.  

  1. Run wordcount on the set of provided sample data (the 7 books) using a cluster of medium instances.  Use the following parameter values: dfs.replication=1, dfs.block.size=67108864, mapred.map.tasks=1, mapred.reduce.tasks=1, mapred.reduce.parallel.copies=5, mapred.tasktracker.map.tasks.maximum=1, mapred.tasktracker.reduce.tasks.maximum=1.
  1. Use 1, 2, 3, 4, and 5 slaves.  
    Does performance scale linearly with an increasing number of slaves?
  2. Use 5 slaves.  Run the job with dfs.replication values of 1 and 3.  
    How does replication level influence performance?
  1. Run teradata to generate 10 million records using a cluster with 5 slaves which are medium instances.  Use the following parameter values: dfs.replication=1, dfs.block.size=67108864, mapred.map.tasks=1, mapred.reduce.tasks=1, mapred.reduce.parallel.copies=5, mapred.tasktracker.map.tasks.maximum=1, mapred.tasktracker.reduce.tasks.maximum=1.
  1. Run the job with dfs.block.size values of 1048576, 33554432,  67108864, and 268435456.
    How does the performance change with different block sizes?
  2. Run the job with dfs.block.size=67108864 and dfs.replication values of 1 and 3.
    Does the replication level influence performance the same as it did for the wordcount job?
  1. Run terasort on 10 million records generated using teradata.  Use a cluster with 5 slaves which are medium instances. Use the following parameter values: dfs.replication=1, dfs.block.size=67108864, mapred.map.tasks=1, mapred.reduce.tasks=1, mapred.reduce.parallel.copies=5, mapred.tasktracker.map.tasks.maximum=1, mapred.tasktracker.reduce.tasks.maximum=1.
  1. Run the job with mapred.map.tasks values of 5, 10, and 15.
    How does the number of map tasks influence performance?
  2. Run the job with mapred.map.tasks=5 and mapred.tasktracker.map.tasks.maximum values of 1, 2, and 3.
    How does the number of tasks per slave influence performance?
  3. Run the job with mapred.map.tasks=1, mapred.tasktracker.map.tasks.maximum=1, and mapred.reduce.tasks values of 1, 5, and 10.
    How does the number of reduce tasks influence performance?
  1. Predict what values you should use for each of the parameters to get the ideal performance when running teradata and terasort (with 10 million records) on a cluster with 5 slaves which are high-cpu medium instances.  
  1. Run these jobs on cluster with that configuration.  
    How did the jobs perform?  Explain.
  2. Change one or more of the parameters, to try to improve the performance, and run the jobs a second time.  
    Did the performance improve?  Why or why not?

Hadoop Configuration

Changing Cluster Size

When you want to change the cluster’s configuration, you first need to stop MapReduce and HDFS by running the following on the master:

ubuntu@master$ stop-mapred.sh
ubuntu@master$ stop-dfs.sh

To increase the size of the cluster, launch another EC2 instance with a “Type” tag of “slave”.   Then, run the configuration script on the master instance and start HDFS and MapReduce:

ubuntu@master$ ./config_cluster.sh
ubuntu@master$ start-dfs.sh
ubuntu@master$ start-mapred.sh

Changing Data Parameters

You can change parameters related to HDFS in the configuration file: $HADOOP_HOME/conf/hdfs-site.xml.  You should make sure HDFS and MapReduce are stopped before you change this file.  After making changes, format HDFS and re-run the config_cluster.sh script before starting HDFS and MapReduce.

ubuntu@master$ hadoop namenode -format
ubuntu@master$ ./config_cluster.sh
ubuntu@master$ start-dfs.sh
ubuntu@master$ start-mapred.sh

Properties to change:

  1. dfs.replication (must be less than or equal to the number of slaves)
  2. dfs.block.size (default is 67108864 bytes)

Changing Compute Parameters

You can change parameters related to running MapReduce jobs in the MapReduce configuration file: $HADOOP_HOME/conf/mapred-site.xml.  You should make sure HDFS and MapReduce are stopped before you change this file.  After making changes, re-run the config_cluster.sh script before starting HDFS and MapReduce.

ubuntu@master$ ./config_cluster.sh
ubuntu@master$ start-dfs.sh
ubuntu@master$ start-mapred.sh

Properties to change:

  1. mapred.map.tasks (Hadoop may change the number of map tasks based on the input data, so the number of map tasks for a job may not always match this parameter)
  2. mapred.reduce.tasks (default is 1; a value of 0 skips the reduce step)
  3. mapred.reduce.parallel.copies (default is 5)
  4. mapred.tasktracker.map.tasks.maximum (determines the number of simultaneous map tasks running on a single slave)
  5. mapred.tasktracker.reduce.tasks.maximum (determines the number of simultaneous reduce tasks running on a single slave)

Deliverables

You should provide a brief write-up (3 to 4 pages using standard ACM SIG Proceedings Alternate format, http://www.acm.org/sigs/publications/proceedings-templates#aL2) with your answers to the questions listed above.

You should submit a PDF of your report by placing it in your assign3 hand-in directory for the course: ~cs838-1/public/handin/$USER/assign3/

Please include, at the end of your report, how many EC2 grant credits you have used.  Exclude the credits used for assignment #2 from this total.   You can find this information on the Activity Summary page for your AWS account (https://portal.aws.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=activity-summary). The EC2 credits remaining are not updated until the end of the billing period. Instead include in your report the number of used credits listed under Details > AWS Service Charges, which is updated daily.  This will help us ensure the course assignments can reasonably be completed within the limits of the free usage tiers and grants.  

Clarifications

  1. You should be run the config_cluster.sh script and all hadoop commands as the ubuntu user.  You should not need to run any commands as root.
  2. Before running your cluster for the first time, you need to format the Hadoop Distributed File System (HDFS) by running the following on the master instance:
    ubuntu@master$ hadoop namenode -format