Ans: Yes, you can upgrade your free trial to pay-as-you-go using the steps specified here. You will still continue to have your $200 Azure credits. The only catch is that incase you go over the limit, Azure will charge it from your credit card automatically.
Ans: Few groups have encountered this error stochastically on a few VMs. This issue can be resolved by running the following ACLI command for the relevant VM - azure vm reset-access -g group<group_number> -n <vm name> -r . In case running this command does not solve the issue, try following the detailed steps mentioned here.
Ans: This may happen in case your password does not meet the standard requirements of Azure. You can reset your password via the Azure portal or through the ACLI using the command mentioned in the link given in Q2.
Ans: There is a minor glitch in Azure due to which the temporary disk has 2 entries in the /proc/mounts file due to which the output of df -H cannot be used to determine the appropriate mount point for the data disk. You need to use sudo fdisk -l to infer the disk device of the data disk and mount it appropriately. On completion of assignment 0, students would know that a VM on start-up in our setup should have 3 disks associated with it - OS disk (capacity ~ 30 GB and mounted on the path /), temporary disk (capacity ~ 215 GB and mounted on the path /mnt) and a data disk (capacity ~ 50GB mounted on the path /workspace). Using the aformentioned command and information, one should be able to figure out the disk device corresponding to the data disk and mount it to /workspace.
Ans: As stated in the assignment, you can extract values of the relevant interface from /proc/net/dev before and after the query.
Ans: Yes, that would include all the disk activity done during the query execution.
Ans: You are required to calculate the network and disk activity for the entire query execution. In the given scenario, the slaves do the actual query execution. You should collect the network/storage bandwidth on every node where tasks can run. If you have decided to run slave instances even on the master node then you should collect the counters from the master node as well.
Ans: No, we are interested in the total network/disk activity for the entire query. You should plot after you are summing up the counters from every node.
Ans: Yes, in MR, the mappers are always reading from HDFS, reducers are aggregators.
Ans: No, you are interested in query-related distributions and not slave/vm level. The task distribution over query lifetime should have X-axis representing time and Y-axis should project for any point in time during the query execution lifetime the total number of tasks running. You can also go ahead and show for any point in time, the number of map tasks running and number of reduce tasks running.
Ans: You can measure performance in terms of total times. However, you may want to optionally also look at other metrics such as disk/network bandwidth to reason out your findings.
Ans: Just to clarify, in Q3 "when the job reaches 25% and 75% of its lifetime" refers to the lifetime of entire query, not just one particular hadoop job. That said, there are multiple ways to determine 25% progress - one could be the way suggested by you. Another simplistic approach would be to run the query completely and know it's query execution time. Using the query execution time, you can approximate 25% or 75% progress.