CS744 Assignment 2

Due: Oct 16, 2018

Overview

In this assignment, you will be learn to write TensorFlow applications. You will start with setting up the cluster and running workloads on a single machine. The next task is to modify the workloads so that they can be launched in a distributed way. You will experiment with both Synchronous and Asynchronous SGD. Finally, you need to look deep into CPU/MEM/Network usage to have a better understand of its performance.

Learning Outcomes

After completing this programming assignment, you should be able to:

Write simple Tensorflow applications and launch them in the cluster.
Have a deeper understanding about Tensorflow’s performance.

Environment Setup

In this assignment, we provide you a CloudLab profile called “cs744-fa18-assignment2-4node” under “UWMadison744-F18” project for you to start your experiment. The profile is a simple 4-node cluster with Ubuntu 16 installed on each machine.

Similar to assignment1, you should firstly make sure that the VMs can ssh each other. To install Tensorflow, you need to run following commands on each VM:

sudo apt-get update
sudo apt-get install --assume-yes python-pip python-dev
sudo pip install tensorflow

Part 1: Logistic Regression

In this part, you will need to implement a simple Logistic Regression application. Building a Tensorflow application mainly consists of two sections: building the computational graph, know as tf.Graph and running it using tf.session. A graph is a series of Tensorflow operations. The graph is submitted to cluster through session.

Your application should start from a random linear vector w as your model. Then, calculate the loss of your model on the dataset using:

$\mathcal{L}(D_{tr}) = \sum_{y, x \in D_{tr}} - y\log{softmax(w^{T}x)}$

x and y are training data features and label respectively. Using Tensorflow API, the loss can be computed through:

prediction  = tf.nn.softmax(tf.matmul(x, W) + b)
loss = tf.reduce_mean(-tf.reduce_sum(y*tf.log(prediction), reduction_indices=1))

Tensorflow provides a set of standard optimization algorithms, which are implemented as sub-classes of tf.train.Optimizer. You can choose one, for example, tf.train.GradientDescentOptimizer to minimize the loss.

The dataset for you to train your model is MNIST handwritten digits database. Tensorflow provides convenient API for you to load input data:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

We provide you a set of scripts and templates help you to run your application, including run_code_template.sh, cluster_utils.sh and code_template.py. You need to first modify the the cluster specification part of code_template.py according to your cluster’s setting and then put your code at the bottom of this file. After that, you can execute run_code_template.sh to run your application.

Synchronous and Asynchronous SGD

In distributed mode, dataset is usually spread among the VMs. On each iteration, the gradients are caculated on each worker machine using its shard of data. In synchronous mode, the gradients will be accumulated to update the model and then go to next iteration. However, in asynchronous mode, there is no accumulation process and the worker nodes update the model independently.

After finishing the implementation, you should also monitor the CPU/Memory/Network usage of each VM during training. You can try to use tools like: dstat or sar. You are welcome to use any other tool you like to monitor the system.

Task 1. Implemet the LR application and train the model using single node mode. We know that Keras API is very easy to use. However, to help you better understand how things work in Tensorflow, we require you not to use it and stick to original API.

Task 2. Implemet the LR application in distributed mode using Sync SGD and Async SGD. Plot the performance and test error for both of them and explain any similarity / differences. Monitor the CPU/Memory/Network usage. Show your observations and determine which one is the bottleneck.

Task 3. Try different batch size and see the difference.

Task 4. Use TensorBoard to visualize the graphs you created in the LR trainning process you just ran. TensorBoard is a suite of visualization tools that can visualize your graph or plot quantitative metrics to help you understand, debug and optimize TensorFlow programs. See sampleTensorboard.sh and exampleTensorboard.py as an example. Screenshot the graphs and include them in your report.

Part 2: AlexNet

In this part, you will play with AlexNet. AlexNet is a very famous convolutional neural network. It consists of five convolutional layers followed by three fully connected layers. It uses ReLU activation function instead of Tanh to add non-linearity.

Here, we provide you an implementation of AlexNet, you can find it here. It is already runnable in a single node. To run it, you first need to modify the cluster specification in startserver.py. Then, run startservers.sh and do:

python -m AlexNet.scripts.train --mode single

Your first task is to implement its distributed mode, you only need to complete the distribute method in alexnetmodes.py. We put some hints about what to do in form of comment under the distribute method. Note, you should always use one parameter server node and multiple worker nodes.

Task 1. Redo the task 2 and task 3 from Part 1 using AlexNet in sync mode only. You can use the given optimizer instead of SGD.

Task 2. Run the AlexNet using two machines. Monitor the CPU/Memory/Network usage and compare it to four machine scenario.

Task 3. (Optional) Run AlexNet on one GPU machine in CloudLab and compare that to above experiments. To use GPU-based machines you will need ot use an appropriate profile, hardware type in CloudLab and as the GPU nodes are not always available this task is optional.

Deliverables

You should submit a tar.gz, named assignment2.tar.gz, to your group’s folder in ~cs744-1/handin/groupx, which consists of a brief report(filename: groupx.pdf) and the code of each task. Put the code of each part and each task into separate folders give them meaningful names. Also put a README file for each task and provide the instructions about how to run your code. Also include a run.sh script for each part of the assignment that can re-execute your code on a similar CloudLab cluster.

Acknowledgements

This assignment uses insights from Professor Aditya Akella’s assignment 3 of CS744 Fall 2017 fall and Professor Mosharaf Chowdhury’s assignment 2 of ECE598 Fall 2017.