In this assignment, you will be learn to write TensorFlow applications. You will start with setting up the cluster and running workloads on a single machine. The next task is to modify the workloads so that they can be launched in a distributed way. You will experiment with both Synchronous and Asynchronous SGD. Finally, you need to look deep into CPU/MEM/Network usage to have a better understand of its performance.
Please, only start 1 experiment per group on cloudlab. Resources are limited so we cannot have more than 1 experiment running per group.
After completing this programming assignment, you should be able to:
In this assignment, we provide you a CloudLab profile called “744-S19-assignment2-3node” under “CS744-S19-assignment2” project for you to start your experiment. The profile is a simple 3-node cluster of machines with Ubuntu 16 installed on each.
Similar to assignment1, you should firstly make sure that each can ssh each other. To install Tensorflow, you need to run following commands on each:
sudo apt-get update
sudo apt-get install --assume-yes python-pip python-dev
sudo pip install tensorflow
In this part, you will need to implement a simple Logistic Regression application. Building a Tensorflow application mainly consists of two sections: building the computational graph, known as tf.Graph and running it using tf.session. A graph is a series of Tensorflow operations. The graph is submitted to cluster through session.
Your application should start from a random linear vector w
as your model. Then, calculate the loss of your model on the dataset using:
x
and y
are training data features and label respectively. Using Tensorflow API, the loss can be computed through:
prediction = tf.nn.softmax(tf.matmul(x, W) + b)
loss = tf.reduce_mean(-tf.reduce_sum(y*tf.log(prediction), reduction_indices=1))
Tensorflow provides a set of standard optimization algorithms, which are implemented as sub-classes of tf.train.Optimizer. You can choose one, for example, tf.train.GradientDescentOptimizer
to minimize the loss.
The dataset for you to train your model is MNIST handwritten digits database. Tensorflow provides convenient API for you to load input data:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
We provide you a set of scripts and templates help you to run your application, including run_code_template.sh, cluster_utils.sh and code_template.py. You need to first modify the the cluster specification part of code_template.py
according to your cluster’s setting and then put your code at the bottom of this file. After that, you can execute run_code_template.sh
to run your application.
In distributed mode, dataset is usually spread among the VMs. On each iteration, the gradients are caculated on each worker machine using its shard of data. In synchronous mode, the gradients will be accumulated to update the model and then go to next iteration. However, in asynchronous mode, there is no accumulation process and the worker nodes update the model independently.
After finishing the implementation, you should also monitor the CPU/Memory/Network usage of each VM during training. You can try to use tools like: dstat or sar. You are welcome to use any other tool you like to monitor the system.
You can use TensorBoard to visualize the graphs you created in the LR trainning process you just ran. TensorBoard is a suite of visualization tools that can visualize your graph or plot quantitative metrics to help you understand, debug and optimize TensorFlow programs. See sampleTensorboard.sh and exampleTensorboard.py as an example.Task 1. Implemet the LR application and train the model using single node mode. We know that Keras API is very easy to use. However, to help you better understand how things work in Tensorflow, we require you not to use it and stick to original API.
Task 2. Implemet the LR application in distributed mode using Sync SGD and Async SGD. Plot the performance and test error for both of them and explain any similarity / differences. Monitor the CPU/Memory/Network usage. Show your observations and determine which one is the bottleneck.
Task 3 (Optional for small EC) Try different batch size and see the difference.
In this part, you will play with AlexNet. AlexNet is a very famous convolutional neural network. It consists of five convolutional layers followed by three fully connected layers. It uses ReLU activation function instead of Tanh to add non-linearity.
Here, we provide you an implementation of AlexNet, you can find it here. It is already runnable in a single node. To run it, you may first need to modify the cluster specification in startserver.py
. Then, run startservers.sh
and do:
python -m AlexNet.scripts.train --mode single
Your first task is to implement its distributed mode, you only need to complete the distribute
method in alexnetmodes.py
. We put some hints about what to do in form of comment under the distribute
method. Note, you should always use one parameter server node and multiple worker nodes.
Task 1. Redo the task 2 from Part 1 using AlexNet in sync mode only. You can use the given optimizer instead of SGD.
Task 2. Run the AlexNet using two machines. Monitor the CPU/Memory/Network usage and compare it to the three machine scenario. Remember you will need to modify startservers.sh
to run in the correct mode.
Task 3. (Optional - no EC) Run AlexNet on one GPU machine in CloudLab and compare that to above experiments. To use GPU-based machines you will need ot use an appropriate profile, hardware type in CloudLab and as the GPU nodes are not always available this task is optional.
You should submit a tar.gz file to canvas, which consists of a brief report (filename: groupx.pdf) and the code of each task (you are in your groups on canvas so only 1 person should need to submit it). Put the code of each part and each task into separate folders give them meaningful names. Also put a README file for each task and provide the instructions about how to run your code. Also include a run.sh
script for each part of the assignment that can re-execute your code on a similar CloudLab cluster.
This assignment uses insights from Professor Mosharaf Chowdhury’s assignment 2 of ECE598 Fall 2017.