Assignment 2

Due Friday, November 4 at 11:59pm


This assignment is designed to support your in-class understanding of how in-memory data analytics stacks and stream processing engines work. You will deploy SPARK, an in-memory big data analytics framework and one of the most popular open source projects nowadays. You will learn how to write native Spark applications and understand the factors that influence it's performance. Next, you will learn how to write streaming applications using Structured Streaming, SPARK's latest library to support development of end-to-end streaming applications. Lastly, you will deploy Apache Storm, a stream processing engine to process real tweets. As in assignment 1, you will produce a short report detailing your observations, scripts and takeaways.

Learning Outcomes

After completing this assignment, you should:

  1. Have gotten a hands-on experience in deploying SPARK-style data analytics stack
  2. Be able to develop simple applications using native SPARK APIs
  3. Be able to reason about SPARK's performance under specific operating parameters
  4. Be able to write end-to-end Streaming application using Structured Streaming APIs
  5. Have gotten experience in using Spark contexts and sessions, Structured Streaming triggers and output modes
  6. Know how to deploy Apache Storm
  7. Be able to write a real time stream processing application using Storm


  1. Groups are highly recommended to see the Piazza post on clarifications regarding the PageRank based questions.
  2. Groups that have not yet setup apache-storm-0.9.5 can setup up a more recent Storm release - apache-storm-1.0.2 instead. In case you have already setup Storm, you can proceed with the assignment questions.
  3. You should always operate using ubuntu user by default.  Use root permissions only when needed(e.g. to modify /etc/hosts, /etc/mysql/my.cnf).
  4. You should empty the buffers cache before any of your experiment run. You can run: sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" on each slave VM.


VMs configuration

We have provided a set of 5 VMs to every group; these are all hosted in CloudLab. One VM has a public IP which is accesible from the external world.  All VMs have a private IP and are reachable from each other.  In your setup, you should designate the VM with a public IP to act as a master/slave while the others are assigned as slaves only.  In the following, when we refer to a master, we refer to either the VM designated as the master, or to a process running on the master VM.

To prepare your VMs for the big data analytics stack deployment, we have performed the following steps:

  1. Setup passwordless connections between the master and slaves
  2. Installed the following packages on every VM
  1. sudo apt-get update --fix-missing.
  2. sudo apt-get install openjdk-7-jdk.
  3. sudo apt-get install pdsh (parallel distributed shell).

  1. Created a hierarchy of directories required during the software deployment step. This hierarchy should exists on all the VMs and it is required by various framework specific configurations. For example, /home/ubuntu/storage/hdfs/hdfs_dn_dirs is required by property in hdfs-site.xml and it determines where on the local filesystem an HDFS data node should store its blocks. /home/ubuntu/logs/apps is required by yarn.nodemanager.log-dirs to store container logs. /home/ubuntu/storage/data/spark/rdds_shuffle is required to store shuffle data and RDDs that get spilled to disk. Spark applications run in /home/ubuntu/storage/data/spark/worker. Other structures such as /home/ubuntu/conf, logs, software, storage, workload are mainly designed for convenience and easier debugging in case you face issues that require further investigation.
  1. /home/ubuntu/conf, logs, software, storage, workload.
  2. /home/ubuntu/logs/apps, hadoop.
  3. /home/ubuntu/storage/data/local/nm, tmp.
  4. /home/ubuntu/storage/hdfs/hdfs_dn_dirs, hdfs_nn_dir.
  5. /home/ubuntu/workload.
  6. /home/ubuntu/storage/data/spark/rdds_shuffle.
  7. /home/ubuntu/logs/spark.
  8. /home/ubuntu/storage/data/spark/worker.

  1. Updated/etc/hosts with information about all VM Private IPs and their hostnames. This file should be updated on every VM in your setup.

Before proceeding, please ensure all the above steps have been done on your VMs.

Software deployment

The following will help you deploy all the software needed and set the proper configurations required for this assignment. You need to download the following configuration archive on every VM, and place the uncompressed files in /home/ubuntu/conf.  In addition, download and deploy the script in your home directory (/home/ubuntu) on every VM. The script enables you to start and stop Hadoop services and configure various environment variables. To function properly, it requires an machines file which contains the IP addresses of all the slave VMs where you plan to deploy your big data software stack, one per line. For example, if you plan to deploy slave processes on VMs with IPs,,, your machines file will look like:

The script used in assignment-1 incorporates additional information required to run the SPARK stack. It defines new environment variables and new commands to start and stop your SPARK cluster.

  1. SPARK_HOME=/home/ubuntu/software/spark-2.0.0-bin-hadoop2.6. // specify the home directory for your SPARK installation
  2. SPARK_CONF_DIR=/home/ubuntu/conf. // specify the location of the configuration files required by SPARK
  3. SPARK_MASTER_IP=ip_master (e.g. // specify the IP of the Master
  4. SPARK_LOCAL_DIRS=/home/ubuntu/storage/data/spark/rdds_shuffle. // specify the directory to store RDDs and shuffle input
  5. SPARK_LOG_DIR=/home/ubuntu/logs/spark. // specify the directory to store spark process logs
  6. SPARK_WORKER_DIR=/home/ubuntu/logs/apps_spark. // application logs directory

Be sure to replaace the SPARK_MASTER_IP and SPARK_MASTER_HOST with the private IP of your master VM in

Next, we will go through deploying Hadoop and Hive (as done in the previous assignment), the Spark stack as well as the Storm stack. The following figure describes the Spark software stack you will deploy:

Apache Hadoop

The first step in building your big data stack is to deploy Hadoop. Hadoop mainly consists of two parts: a distributed filesystem called HDFS and a resource management framework known as YARN. As you will learn in the class, HDFS is responsible for reliable management of the data across the distributed system, while YARN handles the allocation of available resources to existing applications.

To become familiar with the terminology, HDFS consists of a master process running on the master instance, called NameNode and a set of slave processes one running on each slave instance called DataNode. The NameNode records metadata information about every block of data in HDFS, it maintains the status of the DataNodes, and handles the queries from clients. A DataNode manages the actual data residing on the corresponding instance. Similarly, in YARN there is a master process known as ResourceManager which maintains information about the available instances/resources in the system and handles applications resource requests. A NodeManager process is running on every instance and manages the available resources, by periodically "heartbeating" to the ResourceManager to update its status.

Now that you know the terminology, download hadoop-2.6.0.tar.gz. Deploy the archive in the /home/ubuntu/software directory on every VM and untar it (tar -xzvf hadoop-2.6.0.tar.gz).

You should modify core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml in your /home/ubuntu/conf directory on every VM and replace the master_ip with the private IP of your master VM.

Before launching the Hadoop daemons, you should set the proper environment variables from the script and format the Hadoop filesystem. To enable the environment variables you need to run the command: source To format the filesystem (which simply initializes the directory specified by the run the command: hadoop namenode -format.

Finally, you can instantiate the Hadoop daemons by running start_all in your command line on the master VM. To check that the cluster is up and running you can type jps on every VM. Each VM should run the NodeManager and DataNode processes. In addition, the master VM should run the ResourceManager, NodeManager, JobHistoryServer, ApplicationHistoryServer, Namenode, DataNode. The processesJobHistoryServer and ApplicationHistoryServer are history servers designed to persist various log files for completed jobs in HDFS. Later on, you will need these logs to extract various information to answer questions in your assignment. In addition, you can check the status of your cluster, at the following URLs:

  1. http://public_ip_master:50070 for HDFS.
  2. http://public_ip_master:8088 for YARN.
  3. http://public_ip_master:19888 for M/R job history server.
  4. http://public_ip_master:8188 for application history server.

To test that your Hadoop cluster is running properly, you can run a simple MapReduce application by typing the following command on the master VM:

  1. hadoop jar software/hadoop-2.6.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar pi 1 2.

Apache Hive

Hive is a data warehouse infrastructure which provides data summarization, query, and analysis. To deploy Hive, you also need to deploy a metastore service which stores the metadata for tables and partitions in a relational database, and provides Hive access to this information via the metastore service API.

To deploy the metastore service you need to do the following:

  1. Install mysql-server which will be used to store the metastore information. On your master VM you should run the command: sudo apt-get install mysql-server.
  2. After you have installed the mysql-server, you should replace the bind-address in /etc/mysql/my.cnf to point to your eth0 IP address. Finally, restart the mysql server: sudo /etc/init.d/mysql restart.
  3. Login to your mysql console as root: mysql -u root -p .
  4. After you are logged in, create a metastore database in your mysql console: mysql> create database metastore;.
  5. Select the newly created metastore database: mysql> use metastore;.
  6. Create a new user hive identified by password hive :mysql> create user 'hive'@'IP_master' identified by 'hive';.
  7. Grant privileges on the metastore database to the newly created user: mysql> grant all privileges on metastore.* to 'hive'@'IP_master' identified by 'hive'; .
  8. Flush privileges mysql> flush privileges; .
  9. mysql> quit; .

You should modify hive-site.xml in your /home/ubuntu/conf directory on the master VM and replace the master_ip with it's IP.

Finally, download and untar hive-1.2.1.tar.gz in /home/ubuntu/software on the master VM. If you type hive in the command line, you should be able to enter hive console >hive.

Apache Spark

You will use HDFS as the underlying filesystem.

To become familiar with the terminology, SPARK standalone consists of a set of daemons: a Master daemon, which is the equivalent of the ResourceManager in YARN terminology, and a set of Worker daemons, equivalent to the NodeManager processes. SPARK applications are coordinated by a SparkContext object which will connect to the Master, responsible for allocating resources across applications. Once connected, SPARK acquires Executors on every Worker node in the cluster, which are processes that run computations and store data for your applications. Finally, the application's tasks are handled to Executors for execution. You can read further about the SPARK architecture here.

Now that you are familiar with the SPARK terminology, download spark-2.0.0-bin-hadoop2.6.tgz . Deploy the archive in the /home/ubuntu/software directory on every VM and untar it (tar -xzvf spark-2.0.0-bin-hadoop2.6.tgz).

You should modify hive-site.xml and set the hive.metastore.uris property to thrift://ip_master:9083.

Finally, you can instantiate the SPARK daemons by running start_spark on your Master VM. To check that the cluster is up and running you can check that a Master process is running on your master VM, and a Worker is running on each of your slave VMs. In addition you can check the status of your SPARK cluster, at the following URLs:

  1. http://public_ip_master:8080 for SPARK.
  2. http://public_ip_master:4040 for your application context. Note that this will be available only when an application is running.

Apache Zookeeper

Apache Storm has its own stack, and it requires Apache Zookeeper for coordination in cluster mode. To install Apache Zookeeper you should follow the instructions here. First, you need to download zookeeper-3.4.6.tar.gz, and deploy it on every VM in /home/ubuntu/software. Next, you need to create a zoo.cfg file in /home/ubuntu/software/zookeeper-3.4.6/conf directory. You can use zoo.cfg as a template. If you do so, you need to replace the VMx_IP entries with your corresponding VMs IPs and create the required directories specified by dataDir and dataLogDir. Finally, you need to create a myid file in your dataDir and it should contain the id of every VM as following: server.1=VM1_IP:2888:3888, it means that VM1 myid's file should contain number 1.

Next, you can start Zookeeper by running start on every VM. You should see a QuorumPeerMain process running on every VM. Also, if you execute status on every VM, one of the VMs should be a leader, while the others are followers.

Apache Storm

A Storm cluster is similar to a Hadoop cluster. However, in Storm you run topologies instead of "MapReduce" type jobs, which eventually processes messages forever (or until you kill it). To get familiar with the terminology and various entities involved in a Storm cluster please read here.

To deploy a Storm cluster, please read here. For this assignment, you should use apache-storm-1.0.2.tar.gz. You can use this file as a template for the storm.yaml file required. To start a Storm cluster, you need to start a nimbus process on the master VM (storm nimbus), and a supervisor process on any slave VM (storm supervisor). You may also want to start the UI to check the status of your cluster (PUBLIC_IP:8080) and the topologies which are running (storm ui).


Part A - Apache Spark

For Part-A of this assignment, you will write a simple Spark application to compute the PageRank for a number websites and analyze the performance of your application under different scenarios. For this part, ensure that only the relevant Spark and HDFS daemons are running. You can start HDFS daemons using the start_hdfs command and Spark daemons using the start_spark command available through from your master VM. Ensure that all other daemons are stopped.

PageRank Algorithm

PageRank is an algorithm that is used by Google Search to rank websites in their search engine results. This algorithm iteratively updates a rank for each document by adding up contributions from documents that link to it. The algorithm can be summarized in the following steps -

  1. Start each page at a rank of 1.
  2. On each iteration, have page p contribute rank(p)/|neighbors(p)| to its neighbors.
  3. Set each page's rank to 0.15 + 0.85 X contributions.

For detailed information regarding PageRank please read here.

For the purpose of this assignment, we will be using the Berkeley-Stanford web graph and execute the algorithm for a total of 10 iterations. Each line in the dataset consists of a URL and one of it's neighbors. You are required to copy this file to HDFS.

For the questions that follow, please ensure that your application's SparkContext or SparkSession object is overwriting the following properties:

  1. set application name to CS-838-Assignment2-PartA-<Question_Number> .
  2. set driver memory to 1 GB.
  3. enable event log.
  4. specify event log directory.
  5. set executor memory and cores to 1GB and 4 respectively. Increase executor memory only if you encounter memory issues on the cluster.
  6. set number of cpus per task to 1.

Documentation about how to create a SparkContext and set various properties is available here.


Question 1.  Write a Scala/Python/Java Spark application that implements the PageRank algorithm without any custom partitioning (RDDs are not partitioned the same way) or RDD persistence. Your application should utilize the cluster resources to it's full capacity. Explain how did you ensure that the cluster resources are used efficiently. (Hint: Try looking at how the number of partitions of a RDD play a role in the application performance)

Question 2.  Modify the Spark application developed in Question 1 to implement the PageRank algorithm with appropriate custom partitioning. Is there any benefit of custom partitioning? Explain. (Hint: Do not assume that all operations on a RDD preserve the partitioning)

Question 3.  Extend the Spark application developed in Question 2 to leverage the flexibility to persist the appropriate RDD as in-memory objects. Is there any benefit of persisting RDDs as in-memory objects in the context of your application? Explain.

With respect to Question 1-3, for your report you should:

  1. Report the application completion time under the three different scenarios.
  2. Compute the amount of network/storage read/write bandwidth used during the application lifetime under the four different scenarios.
  3. Compute the number of tasks for every execution.
  4. Present / reason about your findings and answer the above questions. Apart from that you should compare the applications in terms of the above metrics and reason out the difference in performance, if any.

Question 4.  Analyze the performance of CS-838-Assignment2-PartA-Question2 by varying the number of RDD partitions from 2 to 100 to 300. Does increasing the number of RDD partitions always help? If no, could you find a value where it has a negative impact on performance and reason about the same.

Question 5.  Visually plot the lineage graph of the CS-838-Assignment2-PartA-Question3 application. Is the same lineage graph observed for all the applications developed by you? If yes/no, why? The Spark UI does provide some useful visualizations.

Question 6.  Visually plot the Stage-level Spark Application DAG (with the appropriate dependencies) for all the applications developed by you till the second iteration of PageRank. The Spark UI does provide some useful visualizations. Is it the same for all the applications? Id yes/no, why? What impact does the DAG have on the performance of the application?

Question 7.  Analyze the performance of CS-838-Assignment2-PartA-Question3 and CS-838-Assignment2-PartA-Question1 in the presence of failures. For each application, you should trigger two types of failures on a desired Worker VM when the application reaches 25% and 75% of its lifetime.

The two failure scenarios you should evaluate are the following:

  1. Clear the memory cache using sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches".
  2. Kill the Worker process.

You should analyze the variations in application completion time. What type of failures impact performance more and which application is more reslient to failures? Explain your observations.


  1. All your applications should take two command line arguments - HDFS path to the web graph and number of iterations the PageRank algorithm should run.
  2. For all your analysis, run the PageRank algorithm for 10 iterations.
  3. All the applications will be executed using spark-submit.

Part B - Structured Streaming

This part of the assignment is aimed at familiarizing you with the process of developing simple streaming applications on big-data frameworks. You will use Structured Streaming (Spark's latest library to build continuous applications) for building your applications. It is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.

Structured Streaming has a new processing model that aids in developing end-to-end streaming applications. Conceptually, Structured Streaming treats all the data arriving as an unbounded input table. Each new item in the stream is like a row appended to the input table. The framework doesn't actually retain all the input, but the results will be equivalent to having all of it and running a batch job. A developer using Structured Streaming defines a query on this input table, as if it were a static table, to compute a final result table that will be written to an output sink. Spark automatically converts this batch-like query to a streaming execution plan.


Before developing the streaming applications, students are recommended to read about Structured Streaming here.

For this part of the assignment, you will be developing simple streaming applications that analyze the Higgs Twitter Dataset. The Higgs dataset has been built after monitoring the spreading processes on Twitter before, during and after the announcement of the discovery of a new particle with the features of the Higgs boson. Each row in this dataset is of the format <userA, userB, timestamp, interaction> where interactions can be retweets (RT), mention (MT) and reply (RE).

We have split the dataset into a number of small files so that we can use the dataset to emulate streaming data. Download the split dataset onto your master VM.


Question 1.   One of the key features of Structured Streaming is support for window operations on event time (as opposed to arrival time). Leveraging the aforementioned feature, you are expected to write a simple application that emits the number of retweets (RT), mention (MT) and reply (RE) for an hourly window that is updated every 30 minutes based on the timestamps of the tweets. You are expected to write the output of your application onto the standard console. You need to take care of choosing the appropriate output mode while developing your application.

In order to emulate streaming data, you are required to write a simple script that would periodically (say every 5 seconds) copy one split file of the Higgs dataset to the HDFS directory your application is listening to. To be more specific, you should do the following -

  1. Copy the entire split dataset to the HDFS. This would be your staging directory which consists of all your data.
  2. Create a monitoring directory too on the HDFS that your application listens to. This would be the directory your streaming application is listening to.
  3. Periodically, move your files from the staging directory to the monitoring directory using the hadoop fs -mv command.

Question 2.  Structured Streaming offers developers the flexibility is decide how often should the data be processed. Write a simple application that emits the twitter IDs of users that have been mentioned by other users every 10 seconds. You are expected to write the output of your application to HDFS. You need to take care of choosing the appropriate output mode while developing your application. You will have to emulate streaming data as you did in the previous question.

Question 3.  Another key feature of Structured Streaming is that it allows developers to mix static data and streaming computations. You are expected to write a simple application that takes as input the list of twitter user IDs and every 5 seconds, it emits the number of tweet actions of a user if it is present in the input list. You are expected to write the output of your application to the console. You will have to emulate streaming data as you did in the previous question.


  1. All your applications should take a command line argument - HDFS path to the directory that your application will listen to.
  2. All the applications will be executed using spark-submit.
  3. All the applications should use the cluster resources appropriately.
  4. You may need to ensure that the hive metastore is running in the background before running any of your applications: hive --service metastore &.

Part C - Apache Storm

For Part-C of this assignment, you will write simple Storm applications to process tweets. For this part, ensure that only the relevant Storm daemons are running and all other daemons are stopped.

Prepare the development environment

To write a Storm application you need to prepare the development environment. First, you need to untar apache-storm-1.0.2.tar.gz on your local machine. If you are using Eclipse please continue reading otherwise skip the rest of this paragraph. First, you need to generate the Eclipse related projects; in the main storm directory, you should run: mvn eclipse:clean; mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true. (We assume you have installed a maven version higher than 3.0 on your local machine). Next, you need to import your projects into Eclipse. Finally, you should create a new classpath variable called M2_REPO which should point to your local maven repository. The typically path is /home/your_username/.m2/repository.

Typical Storm application

Once you open the Storm projects, in apache-storm-1.0.2/examples/storm-starter, you will find various examples of Storm applications which will help you understand how to write / structure your own applications. The figure below shows an example of a typical Storm topology:

The core abstraction in Storm is the stream (for example a stream of tweets), and Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. The basic primitives Storm provides for doing stream transformations are spouts and bolts. A spout is a source of streams, while bolts are consuming any number of input streams either from spouts or other bolts. Finally, networks of spouts and bolts are packaged into a topology that is submitted to Storm clusters for execution. The links between nodes in your topology indicate how data should be passed around. For detailed information please read here.

We recommend you place your Storm applications in the same location as the other Storm examples (storm-starter/src/jvm/storm/starter/ in 0.9.5 version and storm-starter/src/jvm/org/apache/storm/starter/ in 1.0.2 version).To rebuild the examples, you should run: mvn clean install -DskipTests=true; mvn package -DskipTests=true in the storm-starter directory. target/storm-starter-0.9.5-jar-with-dependencies.jar will contain the "class" files of your "topologies". To run a "topology" called "CS838Assignment" in either local or cluster mode, you should execute the following command:

storm jar storm-starter-0.9.5-jar-with-dependencies.jar storm.starter.CS838Assignment [optional_arguments_if_any].

For rebuilding the examples in 1.0.2 version, look at the documentation here.

Connect to Twitter API

In this assignment, your topologies should declare a spout which connects to the Twitter API and emits a stream of tweets. To do so, you should do the following steps:

  1. You should instantiate the class located in storm-starter/src/jvm/storm/starter/spout/
  2. TwitterSampleSpout instantiation requires several authentication parameters which you should get from your Twitter account. Concretely, you need to provide consumerKey, consumerSecret, accessToken, accessTokenSecret. For more information you can check here.
  3. You can start by filling your credentials in topology. It creates a spout to emit a stream of tweets and a bolt to print every streamed tweet.


Question 1.   To answer this question, you need to extend the topology to collect tweets which are in English and match certain keywords (at least 10 distinct keywords) at your desire. You should collect at least 500'000 tweets which match the keywords and store them either in your local filesystem or HDFS. Your topology should be runnable both in local mode and cluster mode.

Question 2.   To answer this question, you need to create a new topology which provides the following functionality: every 30 seconds, you should collect all the tweets which are in English, have certain hashtags and have the friendsCount field satisfying the condition mentioned below. For all the collected tweets in a time interval, you need to find the top 50% most common words which are not stop words ( For every time interval, you should store both the tweets and the top common words in separate files either in your local filesystem or HDFS. Your topology should be runnable both in local mode and cluster mode.

Your topology should have at least three spouts:

  1. Spout-1: connects to Twitter API and emits tweets.
  2. Spout-2: It contains a list of predefined hashtags at your desire. At every time interval, it will randomly sample and propagate a subset of hashtags.
  3. Spout-3: It contains a list of numbers. At every interval, it will randomly sample and pick a number N. If a tweet satisfies friendsCount < N then you keep that tweet, otherwise discard.


  1. Please pick the hashtags list such that every time interval, at least 100 tweets satisfy the filtering conditions and are used for the final processing.
  2. You may want to increase the incoming rate of tweets which satisfy your conditions, using the same tricks as for Question 1.


You should tar all the following files/folders and put it in an archive named group-x.tar.gz -

  1. Provide a brief write-up (3-4 pages, single column) with your answers to the questions listed above (filename: group-x.pdf).
  2. Folder named Part-A should contain the 3 applications developed (filenames: PartAQuestion-<question_number>.py/java/scala) along with a README file.
  3. Folder named Part-B should contain the 3 applications developed (filenames: PartBQuestion<question_number>.py/java/scala) along with a README file.
  4. Folder named Part-C should contain the source files you developed for Part-C along with instructions to compile and run your application.
  5. You should submit the archive by placing it in your assignment2 hand-in directory for the course: ~cs838-1/F16/assignment2/group-x.

    Apart from submitting the above deliverables, you need to setup your cluster so that it can be used for grading the applications developed. More specifically, you should do the following -

    1. Create a folder called grader at the path /home/ubuntu/.
  6. Under the folder, have shell scripts named as -
        1. We should be able to execute the program by running the command - ./ Be sure to hard-code all the inputs required for your programs in the shell script. Also ensure that the commands to start the various daemons work and you have put all the necessary files in the aforementioned folder.
        2. At the end of the deadline, the students will not be able to access the cluster and it will be used by the TAs to do the grading. The reason for doing this is because there may be a number of dependencies in each of your programs and it is best if the programs are evaluated in the environment they were developed.