Assignment 2

Due Sunday, October 29 at 11:59pm

Overview

This assignment is designed to support your in-class understanding of how in-memory data analytics stacks and stream processing engines work. You will learn how to write streaming applications using Structured Streaming, SPARK's latest library to support development of end-to-end streaming applications. You will deploy Apache Storm, a stream processing engine to process real tweets. Finally you will also deploy and run Apache Flink, which another popular streaming analytics system. As in assignment 1, you will produce a short report detailing your observations, scripts and takeaways.

Learning Outcomes

After completing this assignment, you should:

Be able to write end-to-end Streaming application using Structured Streaming APIs
Have gotten experience in using Spark contexts and sessions, Structured Streaming triggers and output modes
Know how to deploy Apache Storm and Flink
Be able to write a real time stream processing application using Storm and Flink

Clarifications

You should always operate using ubuntu user by default. Use root permissions only when needed(e.g. to modify /etc/hosts, /etc/mysql/my.cnf).
You should empty the buffers cache before any of your experiment run. You can run: sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" on each slave VM.

Software Deployment

Apache Zookeeper

Apache Storm has its own stack, and it requires Apache Zookeeper for coordination in cluster mode. To install Apache Zookeeper you should follow the instructions here. First, you need to download zookeeper-3.4.6.tar.gz, and deploy it on every VM in /home/ubuntu/software. Next, you need to create a zoo.cfg file in /home/ubuntu/software/zookeeper-3.4.6/conf directory. You can use zoo.cfg as a template. If you do so, you need to replace the VMx_IP entries with your corresponding VMs IPs and create the required directories specified by dataDir and dataLogDir. Finally, you need to create a myid file in your dataDir and it should contain the id of every VM as following: server.1=VM1_IP:2888:3888, it means that VM1 myid's file should contain number 1.

Next, you can start Zookeeper by running zkServer.sh start on every VM. You should see a QuorumPeerMain process running on every VM. Also, if you execute zkServer.sh status on every VM, one of the VMs should be a leader, while the others are followers.

Apache Storm

A Storm cluster is similar to a Hadoop cluster. However, in Storm you run topologies instead of "MapReduce" type jobs, which eventually processes messages forever (or until you kill it). To get familiar with the terminology and various entities involved in a Storm cluster please read here.

To deploy a Storm cluster, please read here. For this assignment, you should use apache-storm-1.0.2.tar.gz. You can use this file as a template for the storm.yaml file required. To start a Storm cluster, you need to start a nimbus process on the master VM (storm nimbus), and a supervisor process on any slave VM (storm supervisor). You may also want to start the UI to check the status of your cluster (PUBLIC_IP:8080) and the topologies which are running (storm ui).

Apache Flink

To set up Flink on your cluster, download it from this link and untar it with tar -xzvf flink.*.gz in your software directory on the master VM. After extracting flink edit conf/flink-conf.yaml and set jobmanager.rpc.address to the private IP of your master node.(The conf directory here refers to the directory inside your flink installation and not the conf directory in ~/.) Also edit the conf/slaves and enter the IPs of your slave VMs. Each of these worker nodes will later run a TaskManager process.

The Flink directory must be available on every worker under the same path. Copy the entire Flink directory to every worker node. Once you have performed the above steps, run bin/start-cluster.sh inside the Flink directory of your master node to start Flink. Once you have set Flink cluster up, you can use the Flink quickstart script to setup a project for you. To do this, run the following commands in your home directory on the master.

sudo apt-get install maven
curl https://flink.apache.org/q/quickstart.sh | bash
cd quickstart
mvn package

if the build is successful run the following example to verify your installation and project setup.
~/software/flink-1.3.2/bin/flink run -c org.myorg.quickstart.WordCount ~/quickstart/target/quickstart-0.1.jar

Questions

Part A - Structured Streaming

This part of the assignment is aimed at familiarizing you with the process of developing simple streaming applications on big-data frameworks. You will use Structured Streaming (Spark's latest library to build continuous applications) for building your applications. It is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.

Structured Streaming has a new processing model that aids in developing end-to-end streaming applications. Conceptually, Structured Streaming treats all the data arriving as an unbounded input table. Each new item in the stream is like a row appended to the input table. The framework doesn't actually retain all the input, but the results will be equivalent to having all of it and running a batch job. A developer using Structured Streaming defines a query on this input table, as if it were a static table, to compute a final result table that will be written to an output sink. Spark automatically converts this batch-like query to a streaming execution plan.

Before developing the streaming applications, students are recommended to read about Structured Streaming here.

For this part of the assignment, you will be developing simple streaming applications that analyze the Higgs Twitter Dataset. The Higgs dataset has been built after monitoring the spreading processes on Twitter before, during and after the announcement of the discovery of a new particle with the features of the Higgs boson. Each row in this dataset is of the format <userA, userB, timestamp, interaction> where interactions can be retweets (RT), mention (MT) and reply (RE).

We have split the dataset into a number of small files so that we can use the dataset to emulate streaming data. Download the split dataset onto your master VM.

Questions

Question 1. One of the key features of Structured Streaming is support for window operations on event time (as opposed to arrival time). Leveraging the aforementioned feature, you are expected to write a simple application that emits the number of retweets (RT), mention (MT) and reply (RE) for an hourly window that is updated every 30 minutes based on the timestamps of the tweets. You are expected to write the output of your application onto the standard console. You need to take care of choosing the appropriate output mode while developing your application.

In order to emulate streaming data, you are required to write a simple script that would periodically (say every 5 seconds) copy one split file of the Higgs dataset to the HDFS directory your application is listening to. To be more specific, you should do the following -

Copy the entire split dataset to the HDFS. This would be your staging directory which consists of all your data.
Create a monitoring directory too on the HDFS that your application listens to. This would be the directory your streaming application is listening to.
Periodically, move your files from the staging directory to the monitoring directory using the hadoop fs -mv command.

Question 2. Structured Streaming offers developers the flexibility is decide how often should the data be processed. Write a simple application that emits the twitter IDs of users that have been mentioned by other users every 10 seconds. You are expected to write the output of your application to HDFS. You need to take care of choosing the appropriate output mode while developing your application. You will have to emulate streaming data as you did in the previous question.

Question 3. Another key feature of Structured Streaming is that it allows developers to mix static data and streaming computations. You are expected to write a simple application that takes as input the list of twitter user IDs and every 5 seconds, it emits the number of tweet actions of a user if it is present in the input list. You are expected to write the output of your application to the console. You will have to emulate streaming data as you did in the previous question.

Notes:

All your applications should take a command line argument - HDFS path to the directory that your application will listen to.
All the applications will be executed using spark-submit.
All the applications should use the cluster resources appropriately.
You may need to ensure that the hive metastore is running in the background before running any of your applications: hive --service metastore &.

Part B - Apache Storm

For Part-B of this assignment, you will write simple Storm applications to process tweets. For this part, ensure that only the relevant Storm daemons are running and all other daemons are stopped.

Prepare the development environment

To write a Storm application you need to prepare the development environment. First, you need to untar apache-storm-1.0.2.tar.gz on your local machine. If you are using Eclipse please continue reading otherwise skip the rest of this paragraph. First, you need to generate the Eclipse related projects; in the main storm directory, you should run: mvn eclipse:clean; mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true. (We assume you have installed a maven version higher than 3.0 on your local machine). Next, you need to import your projects into Eclipse. Finally, you should create a new classpath variable called M2_REPO which should point to your local maven repository. The typically path is /home/your_username/.m2/repository.

Typical Storm application

Once you open the Storm projects, in apache-storm-1.0.2/examples/storm-starter, you will find various examples of Storm applications which will help you understand how to write / structure your own applications. The figure below shows an example of a typical Storm topology:

The core abstraction in Storm is the stream (for example a stream of tweets), and Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. The basic primitives Storm provides for doing stream transformations are spouts and bolts. A spout is a source of streams, while bolts are consuming any number of input streams either from spouts or other bolts. Finally, networks of spouts and bolts are packaged into a topology that is submitted to Storm clusters for execution. The links between nodes in your topology indicate how data should be passed around. For detailed information please read here.

We recommend you place your Storm applications in the same location as the other Storm examples (storm-starter/src/jvm/storm/starter/PrintSampleStream.java in 0.9.5 version and storm-starter/src/jvm/org/apache/storm/starter/PrintSampleStream.java in 1.0.2 version).To rebuild the examples, you should run: mvn clean install -DskipTests=true; mvn package -DskipTests=true in the storm-starter directory. target/storm-starter-0.9.5-jar-with-dependencies.jar will contain the "class" files of your "topologies". To run a "topology" called "cs744Assignment" in either local or cluster mode, you should execute the following command:

storm jar storm-starter-0.9.5-jar-with-dependencies.jar storm.starter.cs744Assignment [optional_arguments_if_any].

For rebuilding the examples in 1.0.2 version, look at the documentation here.

Connect to Twitter API

In this assignment, your topologies should declare a spout which connects to the Twitter API and emits a stream of tweets. To do so, you should do the following steps:

You should instantiate the TwitterSampleSplout.java class located in storm-starter/src/jvm/storm/starter/spout/TwitterSampleSpout.java.
TwitterSampleSpout instantiation requires several authentication parameters which you should get from your Twitter account. Concretely, you need to provide consumerKey, consumerSecret, accessToken, accessTokenSecret. For more information you can check here.
You can start by filling your credentials in PrintSampleStream.java topology. It creates a spout to emit a stream of tweets and a bolt to print every streamed tweet.

Questions

Question 1. To answer this question, you need to extend the PrintSampleStream.java topology to collect tweets which are in English and match certain keywords (at least 10 distinct keywords) at your desire. You should collect at least 200'000 tweets which match the keywords and store them either in your local filesystem or HDFS. Your topology should be runnable both in local mode and cluster mode.

Question 2. To answer this question, you need to create a new topology which provides the following functionality: every 30 seconds, you should collect all the tweets which are in English, have certain hashtags and have the friendsCount field satisfying the condition mentioned below. For all the collected tweets in a time interval, you need to find the top 50% most common words which are not stop words (http://www.ranks.nl/stopwords). For every time interval, you should store both the tweets and the top common words in separate files either in your local filesystem or HDFS. Your topology should be runnable both in local mode and cluster mode.

Your topology should have at least three spouts:

Spout-1: connects to Twitter API and emits tweets.
Spout-2: It contains a list of predefined hashtags at your desire. At every time interval, it will randomly sample and propagate a subset of hashtags.
Spout-3: It contains a list of numbers. At every interval, it will randomly sample and pick a number N. If a tweet satisfies friendsCount < N then you keep that tweet, otherwise discard.

Notes:

Please pick the hashtags list such that every time interval, at least 100 tweets satisfy the filtering conditions and are used for the final processing.
You may want to increase the incoming rate of tweets which satisfy your conditions, using the same tricks as for Question 1.

Part C - Flink

Flink is another stream processing framework for distributed, high-performance, reliable and accurate data streaming applications. It has a unified architecture for both stream and batch data processing, and it treats batch processing as a special case of stream processing where the stream is infinite. . It supports event time, ingestion time and processing time process, and it has a flexible windowing mechanism (for example, tumbling window, sliding window and etc.) Look here for a detailed description of Flink's useful features.

Streaming source: we will use the same dataset as that of structured streaming part (here, copy this data set to all the vms in the same path), using Collection-based Data Source to simulate the streaming source. reference to this page for more detail. Read the input file and create a data stream according to the input file (Set the rate to 10 lines per millisecond, you can achieve this by sleep 1 ms per 10 lines). The source code to simulate stream is provided here. You can use it to simulate a stream for Flink using the higgs data set.

Questions

Question 1. Develop an application to find all the disjoint 1-min time intervals that have more than 100 retweets (RT), mention (MT) or reply (RE). (E.g., the intervals will be[0, 1] min, [1, 2] min etc.) What if the time intervals do not need to be disjoint? (E.g., [0, 60s], [1s, 61s], [2s, 62s] etc.) You can assume that the data is sorted according to the timestamp. Output all the satisfying time windows, according to the following format (using millisecond timestamp format).

TimeWindow{start=1341398070000, end=1341398130000} Count: 191 Type: RT

TimeWindow{start=1341398070000, end=1341398130000} Count: 102 Type: MT

TimeWindow{start=1341398100000, end=1341398160000} Count: 189 Type: RT

Question 2. Handle out-of order data. Flink allows the data to arrive in a different order from their real timestamps, which is a common case in real time streaming systems. Change your dijoint 1 minute time window application in Question 1, allow late arrive. Change the time bound of the late allowance to 30 seconds, 60 seconds, 100 seconds and 500 seconds. For each of these allowance values, compute the number of intervals that stay the same (compared to your original application in question 1). A window is the same as another window if start, end, count and type fields of the two windows are the same. Please use shuffled data set which simulates late arrivals for your application in question 2.

Deliverables

You should tar all the following files/folders and put it in an archive named group-x.tar.gz

Provide a brief write-up (3-4 pages max, single column) with your answers to the questions listed above (filename: group-x.pdf).
Folder named Part-A should contain the 3 applications developed (filenames: PartAQuestion<question_number>.py/java/scala) along with a README file.
Folder named Part-B should contain the source files you developed for Part-B along with instructions to compile and run your application.

Folder named Part-C should contain the 2 applications developed (filenames: PartCQuestion-<question_number>.java/scala) along with a README file.

You should submit the archive by placing it in your assignment2 hand-in directory for the course: ~cs744-1/handin/group-x/assignment2/.

Apart from submitting the above deliverables, you need to setup your cluster so that it can be used for grading the applications developed. More specifically, you should do the following -

Create a folder called grader_assign2 at the path /home/ubuntu/.

Under the folder, have shell scripts named as -

PartAQuestion1.sh
PartAQuestion2.sh
PartAQuestion3.sh
PartBQuestion1.sh
PartBQuestion2.sh
streaming-emulator.sh
PartCQuestion1.sh
PartCQuestion2.sh

We should be able to execute the program by running the command - ./file-name.sh. Be sure to hard-code all the inputs required for your programs in the shell script. Also ensure that the commands to start the various daemons work and you have put all the necessary files in the aforementioned folder. Finally make sure, that each of these scripts print their output on the console.
At the end of the deadline, the students will not be able to access the cluster and it will be used by the TAs to do the grading. The reason for doing this is because there may be a number of dependencies in each of your programs and it is best if the programs are evaluated in the environment they were developed.