Assignment 1 FAQs and Best Practices

FAQs

I have signed up for Azure using a non-wisc account. Is there a way by which I can increase the limit on the number of cores or do I need to create another account?

Ans: Yes, you can upgrade your free trial to pay-as-you-go using the steps specified here. You will still continue to have your $200 Azure credits. The only catch is that incase you go over the limit, Azure will charge it from your credit card automatically.

I am unable to log-in into my Azure VM and I am getting the port 22: Connection refused error. How do I proceed?

Ans: Few groups have encountered this error stochastically on a few VMs. This issue can be resolved by running the following ACLI command for the relevant VM - azure vm reset-access -g group<group_number> -n <vm name> -r . In case running this command does not solve the issue, try following the detailed steps mentioned here.

I am unable to access my VM using my password. What do I do?

Ans: This may happen in case your password does not meet the standard requirements of Azure. You can reset your password via the Azure portal or through the ACLI using the command mentioned in the link given in Q2.

On starting my VM after deallocation, I am not able to see /workspace. What needs to be done so that I can access the contents of the data disk?

Ans: There is a minor glitch in Azure due to which the temporary disk has 2 entries in the /proc/mounts file due to which the output of df -H cannot be used to determine the appropriate mount point for the data disk. You need to use sudo fdisk -l to infer the disk device of the data disk and mount it appropriately. On completion of assignment 0, students would know that a VM on start-up in our setup should have 3 disks associated with it - OS disk (capacity ~ 30 GB and mounted on the path /), temporary disk (capacity ~ 215 GB and mounted on the path /mnt) and a data disk (capacity ~ 50GB mounted on the path /workspace). Using the aformentioned command and information, one should be able to figure out the disk device corresponding to the data disk and mount it to /workspace.

To compute the amount of read/write bandwidth for network, should we simply inspect the counter at /proc/net/dev before and after query or should we use per-port tools like iftop?

Ans: As stated in the assignment, you can extract values of the relevant interface from /proc/net/dev before and after the query.

For disk usage, should we look at all 3 disks and sum them up?

Ans: Yes, that would include all the disk activity done during the query execution.

Should the statistics be collected for all 4 vms?

Ans: You are required to calculate the network and disk activity for the entire query execution. In the given scenario, the slaves do the actual query execution. You should collect the network/storage bandwidth on every node where tasks can run. If you have decided to run slave instances even on the master node then you should collect the counters from the master node as well.

Should we have 4 curves in our plot for network/disk bandwidth?

Ans: No, we are interested in the total network/disk activity for the entire query. You should plot after you are summing up the counters from every node.

In 1.c my understanding is that: tasks that aggregate data are reduce task and tasks that read data from hdfs are map tasks, is this correct?

Ans: Yes, in MR, the mappers are always reading from HDFS, reducers are aggregators.

What is meant by task distribution over query lifetime, is it a summary of task-to-vm assignment?

Ans: No, you are interested in query-related distributions and not slave/vm level. The task distribution over query lifetime should have X-axis representing time and Y-axis should project for any point in time during the query execution lifetime the total number of tasks running. You can also go ahead and show for any point in time, the number of map tasks running and number of reduce tasks running.

For Q2 is performance just total time or other metrics as well?

Ans: You can measure performance in terms of total times. However, you may want to optionally also look at other metrics such as disk/network bandwidth to reason out your findings.

For Q3, for Hive/MR, we can determine progress for a job/app via the /cluster/app/ webpage (is this the best way?). Since a query is broken into several jobs, we are having trouble figuring out how to determine 25% progress for the entire query? We were thinking of just simply checking the tez.out file periodically to determine when 25% of the total tasks were completed?

Ans: Just to clarify, in Q3 "when the job reaches 25% and 75% of its lifetime" refers to the lifetime of entire query, not just one particular hadoop job. That said, there are multiple ways to determine 25% progress - one could be the way suggested by you. Another simplistic approach would be to run the query completely and know it's query execution time. Using the query execution time, you can approximate 25% or 75% progress.

Best Practices

Always remember to deallocate all your VMs. Otherwise, Azure continues to charge for compute hours.

In case you have setup the cluster to carry out the various experiments, ensure that no Java processes are running before you initiate the deallocation operation. This can be achieved by running the stop_all command on the master VM.

We encourage students to write helper scripts where applicable in the assignment.

We encourage students to copy all their data to a local machine and then process it. This would ensure that you do not waste Azure credits on processing files. Try using the cluster only to run the experiments and do the processing offline.

The contents of the temporary disk (mounted on /mnt) will be lost when VMs are started after deallocation. Students are encouraged to copy the required contents of the disk locally