Assignment 3: TensorFlow and GraphX

Due Friday, December 2 at 11:59pm

Overview

In this assignment you will learn:

To use TensorFlow for numerical operations on large datasets. TensorFlow has a steep learning curve; so, please start working on it early.
To use Spark's GraphX library for graph-parallel computation.

For this assignment, students are encouraged to answer others' questions on Piazza. Participation counts for 10% of your credits. Use the CloudLab clusters for running the experiments.

Part-A: TensorFlow

The TensorFlow (TF) part of the assignment is split into:

Installation
Distributed TF setup
Multiplicaton of large matrices
Machine Learning using Synchronous and Asynchronous Stochastic Gradient Descent (SGD)

To get you started, we have provided some sample programs and utility scripts. Please find all of them in the tarball here. Extract all the files into a working directory in any of the VMs in the cluster. Links to individual files are also provided as and when they are referred in the description below.

Installation

TF has to be installed on all the VMs. To install TF, execute the following commands:

sudo apt-get install --assume-yes python-pip python-dev
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0rc2-cp27-none-linux_x86_64.whl
sudo pip install --upgrade $TF_BINARY_URL

Detailed instructions on TF installation can be found here. We have also provided a simple bash function (install_tensorflow) in tfdefs.sh. Use checkInstallation.py to see if the right version of TF is installed.

Installing TF, will also install tensorboard a utility for visualizing the graphs that client creates. See sampleTensorboard.sh (and exampleTensorboard.py referenced by it); this script provides an end-to-end workflow for using TensorFlow and Tensorboard. Make sure you run tensorboard on the VM with public IP; otherwise, you will not be able to view the web interface from the browser.

Distributed TensorFlow setup

In TensorFlow, a cluster is a set of processes running on different nodes. Jobs are submitted to the cluster through a Session. A job is a Graph of inter-dependent Operations which are executed using these processes. Individual operators are executed on one of the processes in a cluster. Unlike other big data systems, individual operators cannot be spread across multiple processes. Operators exchange information by passing tensors between them.

Read sampleDistributed.sh (and exampleDistributed.py referenced in it) to see how to start a cluster, use a client to launch jobs and terminate a cluster.

Multiplying Large Matrices

Here, you will find the trace (sum of diagonal elements) of square of a random $n \times n$ matrix, A, where $n=100,000$. Unlike Spark, TF does not provide a big-data abstraction. So your program cannot just generate the matrix, square it and compute the trace; see exampleMatmulFailure.py to see what we mean. The script will fail (Out-Of-Memory error) on machines with similar configurations as your CloudLab VMs.

To get around this problem, you have to break down the large matrix (or generate them in pieces), and schedule compuations on the smaller pieces to achieve your end-goal. We can represent a large matrix as:

\[ A = [A_{11} A_{12} \ldots A_{1n}; A_{21} A_{22} \ldots A_{2n}; \ldots ; A_{n1} \ldots A_{nn}] \]

The trace of $A^2$ can now be computed as

\[ tr(A^2) = \sum_{i} \sum_{j} tr(A_{ij} A_{ji}) \]

Refer exampleMatmulSingle.py for an implementation on a single device. However, if all computations are scheduled on a single node, the running time can be large (up to an hour). Your task is to extend/modify exampleMatmulSingle.py to run the individual computations on process on all 5 VMs and report the trace of $A^2$. The goal is to locate computations to speed up the end-to-end job completion time. (We will see if we can get something like a leaderboard going; no promises though :) )

Hint: Refer exampleDistributed.py to see how to place operations on different devices.

Machine Learning using SGD

TF Variables

If you have completed all prior steps in the TF assignment, you will have a good understanding of Tensor, Operator and Graph abstraction. TF also provides a Variable abstraction that is critical for Machine Learning jobs. A TF Variable is an abstraction that is similar to a heap variable in a normal C program. When you create a TF variable, some amount of memory is allocated for the variable in one of the processes in the cluster. The memory and the values stored in it is persisted across multiple jobs and sessions. Tensors, on the other hand, are immutable objects which can be de-allocated (read: garbage-collected) after they have been consumed by an operator or a session is terminated. We have provided a sample TF client (exampleVariablePersistence.py) to help you understand the properties of TF variables. After starting a cluster, try running the client multiple times (even simultaneously) to observe the changes to the variables defined. Typically, TF variables are used to store the model while running ML algorithms.

Logistic Regression

In this assignment, you will implement Binary Logistic Regression (LR) to learn a model for predicting if a user will click on a advertisement or not. LR works as follows: you will start of with a random linear vector ($w$) as your model. Using this model the aggregate loss on the training dataset is defined as $$ \begin{equation} \mathcal{L}(D_{tr}) = \sum_{y, x \in D_{tr}} -\log{\frac{1}{1 + e^{-y*w^Tx}}} \label{loss} \end{equation} $$ where, $y$ and $x$ are the training label and feature respectively. $y$ takes value -1 if there is no click and +1 if the advertisement is clicked. $x$ is a high dimensional vector. Using API's similar to TF the loss per data point is obtainable as:

loss = -1 * tf.log(tf.sigmoid(y * w.dot(x)))

The optimal model that minimizes the loss is obtained by running SGD. In each iteration of SGD, we will update the model using gradient computed on a batch of samples from the training data. Given $(y, x)$, the gradient is obtained as:

gradient = tf.Variable(tf.zeros(x.shape))
for y,x in sample:
  gradient += y * (sigmoid(y * w.dot(x)) - 1) x

The model is then updated as

w -= eta * gradient

where, eta is the learning rate (typically set to small constant). Choose eta=0.01 for your experiments.

Synchronous SGD

If the dataset is spread across multiple VMs, then TF processes on each VM can read a sample from their local partition and compute gradients. The gradients can then be accumulated and used to update the model. This version of distributed SGD is called as synchronous-SGD.

You will implement Synchronous-SGD, by constructing a graph with operators that are placed on different processes in the cluster and read from different samples of data. A single client executes this graph over multiple iterations (each time reading a different sample of data) and updates the model variable at the end of each iteration.

Asynchronous SGD

In synchronous-SGD, the time taken for each iteration is determined by the slowest operator. To avoid such bottlenecks, it is common practices to run SGD in an asynchronous fashion.

For Asynchronous-SGD, you will create a graph and client, that will be launched multiple times; however, they are launched on different processes. The different jobs exchange information by reading and writing from the same TF variable. Since TF variables are shared by sessions, computations launched by the different clients, effectively end up reading/writing from the same portion in memory. Note, there is no aggregation of gradients computed by different processes.

See exampleSynchronousUpdate.py, exampleAsynchronousUpdate.py and exampleAsync.sh for sample clients that update a variable synchronously and asynchronously.

Dataset

The dataset we will use is the one used for the Kaggle Display Advertising Challenge sponsored by Criteo Labs. It is available here. The categorical features have been converted to numerical features using one-hot encoding and stored in a format which can be easily read by TensorFlow programs (TFRecords). The input data is split into 23 small chunks (tfrecords00 ... tfrecords22) each with 2,000,000 data points. Place the first 5 on VM-1, next 5 five on VM-2, ... VM-5 will contain only 2 chunks. Use rfrecords22 for testing your model.

See exampleReadCriteoData.py data in each iteration of the model using the local partition. The scripts tarball also has tiny versions of input files, please use them for testing your code.

When to terminate training

Terminate the experiment after 100,000,000 updates have been made to the model. Note that in synchronous-SGD, a single update is essentially an update from all workers. So if you have 5 workers, stop learning after 20,000,000 iterations.

How to calculate prediction errors

While training, find the prediction error by using the data chunks you did not use for training. For a test data point (y, x), the error is defined as

error = (y != sign(w.dot(x))) ? 1 : 0

Essentially, if w.dot(x) > 0 and y = 1 or w.dot(x) <= 0 and y = -1, error is zero. Plot the average error over the entire test set for every $100,000^{th}$ iterations.

TensorFlow Deliverables

You should submit the following under the Part-A folder at ~cs838-1/F16/assignment3/group-x -

A TF client program (named:bigmatrixmultiplication.py) for big-matrix multiplication
A TF client program for synchronous-SGD (named: synchronoussgd.py). If you are using a separate script for batch processing, submit another file named: batchsynchronoussgd.py.
A TF client program for asynchronous-SGD (named: asyncsgd.py) along with a shell script (named: launch_asyncsgd.sh) specifying how to launch multiple clients. If you are using a separate script for batch processing, submit files: batchasyncsgd.py and launch_batchasyncsgd.sh.
A README file explaining the logic behind the 3 applications developed along with instructions to run them.
A report (named: group-x-part-a.pdf) containing the following:
- A brief summary of how ML on TF differs from implementing ML on Spark. Write what you think are the pros and cons of each framework.
- Graphs plotting the variation in test error (for every $100,000^{th}$ iteration) for both Synchronous and Asynchronous-SGD
- Explain with supporting numbers what is the bottleneck resource for running synch and async-SGD. Is it the disk IO? CPU? Network?

Apart from submitting the above deliverables, you need to setup your cluster so that it can be used for grading the applications developed. More specifically, you should do the following -

Create a folder called grader/part-a/ at the path /home/ubuntu/.
Under the folder, have the following scripts named as:

      1. bigmatrixmultiplication.py
      2. synchronoussgd.py
      3. asyncsgd.py
      4. launch_asyncsgd.sh

We should be able to execute the programs by running the command - ./file-name.sh or python file-name.py. Be sure to hard-code all the inputs required for your programs . Also ensure that the commands to start the various daemons work and you have put all the necessary files in the aforementioned folder.

Extra credit The data reader described in exampleReadCriteoData.py reads only one sample at a time. Reading batches of data with sparse features gets tricky. However, in a typicaly distributed SGD implementation, individual workers compute gradient on a batch of samples. Run the SGD algorithms for different batch sizes (100, 1000, 10,000) and report your findings. See this page to get an idea on how to read batches of sparse data.

Part-B: GraphX

Prepare the development environment

GraphX is a component in SPARK for graphs and graph-parallel computation. To get a better understanding about how to write simple graph processing applications using GraphX, you should check here.

Developing a GraphX application involves the same steps you followed for assignment 2. Your SparkContext or SparkSession should have the same configuration parameters you used before.

Application - 1

Question 1. Write an application that implements the PageRank algorithm using the various constructs that Graph-X provides.

Note: Your application cannot use the built-in GraphX PageRank object.

We will be using the LiveJournal social network dataset for this assignment. You are required to execute the algorithm for a total of 20 iterations. Each line in the dataset consists of FromNodeId and one of it's neighbors. You are required to copy this file to HDFS.

For the report you need to compare the performance of this application with the application named CS-838-Assignment2-PartA-Question3 for the dataset mentioned above. More specifically,

Compare the application's completion time.
Compute the amount of network/storage read/write bandwidth used during the application lifetimes.
Compute the number of tasks for every execution.
Does GraphX provide additional benefits while implementing the PageRank algorithm? Explain and reason out the difference in performance, if any.

Application - 2

Building the graph

You will run GraphX on a graph which you will build based on the output of Question 2 from Assignment 2 - Part C. More concretely your graph should be as following:

A vertex embodies the top most common words you found during a time interval.
Two vertices A and B have a single edge if they share at least a common word among themselves.

Note: Please ensure that your graph has at least 200 vertices.

Questions

Given the graph you build as mentioned above, you should write Scala scripts to answer the following questions:

Find the number of edges where the number of words in the source vertex is strictly larger than the number of words in the destination vertex. Hint: to solve this question please refer to the Property Graph examples from here.
Find the most popular vertex. A vertex is the most popular if it has the most number of edges to its neighbors and it has the maximum number of words. If there are many satisfying the above criteria, pick any one you desire. Hint: to solve this question please refer to the Neighborhood Aggregation examples from here. You can ignore the frequency of the words for this question.
Find the average number of words in every neighbor of a vertex. Hint: to solve this question please refer to the Neighborhood Aggregation examples from here.

Extra Credit

Find the most popular word across all the tweets.
Find the size of the largest subgraph which connects any two vertices.
Find the number of time intervals which have the most popular word.

GraphX Deliverables

You should submit the following under the Part-B folder at ~cs838-1/F16/assignment3/group-x -

Provide a brief write-up (2-3 pages, single column) with your answers to the questions listed above (filename: group-x-part-b.pdf). Make sure to include your names and group number in the writeup.
Provide the various applications developed (filenames: PartBApplication<application_number>Question<question_number>.scala) along with a README file explaining the logic for each of your applications and instructions to run them.

Apart from submitting the above deliverables, you need to setup your cluster so that it can be used for grading the applications developed. More specifically, you should do the following -

Create a folder called grader/part-b/ at the path /home/ubuntu/.
Under the folder, have shell scripts named as:

      1. PartBApplication1Question1.sh
      2. PartBApplication2Question1.sh
      3. PartBApplication2Question2.sh
      4. PartBApplication2Question3.sh
      5. PartBApplication2Question4.sh
      6. PartBApplication2Question5.sh
      7. PartBApplication2Question6.sh

We should be able to execute the program by running the command - ./file-name.sh. Be sure to hard-code all the inputs required for your programs in the shell script. Also ensure that the commands to start the various daemons work and you have put all the necessary files in the aforementioned folder.
The hand-in directory ~cs838-1/F16/assignment3/group-x has a file named hand-in-consistency-checker.py. This file checks whether all the required files have been submitted with the expected names. Please run this program from your group-x folder - python hand-in-consistency-checker.py.
Groups are strongly recommended to make sure that the ALL deliverables are submitted in the required format. Failure to do so may lead to reduction in the assignment score.
At the end of the deadline, the students will not be able to access the cluster and it will be used by the TAs to do the grading. The reason for doing this is because there may be a number of dependencies in each of your programs and it is best if the programs are evaluated in the environment they were developed.

Other references

https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/