Due October 12 at 11:59pm
This assignment is designed to support your in-class understanding of how data analytics stacks work and the factors influencing their performance. You will deploy SPARK, an in-memory big data analytics framework and one of the most popular open source projects nowadays. Similar to assignment 1, you will run a set of SQL queries (or "jobs") atop SPARK SQL using SPARK standalone as an execution framework and HDFS as the distributed filesystem. You will learn how to use various SPARK contexts and how to write your own driver program and queries using SPARK API. Finally, you will produce a short report detailing your observations, scripts and takeaways.
After completing this assignment, you should:
For this assignment you will use the same set of 4 VMs we provided for assignment-1. In addition to the steps mentioned in assignment-1, you need to perform the following:
The following will help you deploy all the software needed, and set the proper configurations required for this assignment. You need to replace your run.sh script from assignment-1 with an updated version of run.sh. The script should be updated on every machine and incorporates additional information required to run the SPARK stack. It defines new environment variables and new commands to start and stop your SPARK cluster.
To run in cluster mode, you need to create a slaves file in your $SPARK_CONF_DIR directory. The file should contains the IP address of every VM which will run a Worker daemon, one per line. Next, we will go through deploying the SPARK stack. The following figure describes the software stack you will deploy:
You will use HDFS as the underlying
filesystem. You already deployed HDFS for assignment-1. If our
HDFS daemons are down, you can start them
using start_hdfs command
available through run.sh
from your master VM. To become familiar with the terminology, SPARK standalone consists of a set of daemons: a Master daemon, which is the equivalent of the ResourceManager in YARN terminology, and a set of Worker daemons, equivalent to the NodeManager processes. SPARK applications are coordinated by a SparkContext object which will connect to the Master, responsible for allocating resources across applications. Once connected, SPARK acquires Executors on every Worker node in the cluster, which are processes that run computations and store data for your applications. Finally, the application's tasks are handled to Executors for execution. You can read further about the SPARK architecture here. Now that you are familiar with the SPARK terminology, download spark-1.5.0-bin-hadoop2.6.tgz . Deploy the archive in the /home/ubuntu/software directory on every VM and untar it (tar -xzvf spark-1.5.0-bin-hadoop2.6.tgz). You should modify hive-site.xml and set the Finally, you can instantiate the SPARK daemons by running start_spark on your Master VM. To check that the cluster is up and running you can check that a Master process is running on your master VM, and a Worker is running on each of your slave VMs. In addition you can check the status of your SPARK cluster, at the following URLs: Unlike Hive, which requires proper deployment, SPARK SQL is tightly integrated in the SPARK stack. No further action is required in order to enable SPARK SQL. For the purpose of this assignment, you will benchmark the same set of SQL queries you used in assignment-1: queries 12, 21, 50, 85 from the TPC-DS benchmark. You should use the same workload generator as before. To run SPARK SQL queries for question1 you need the following information:
Software deployment
SPARK
SPARK SQL
Workload
hive --service metastore &. This will give you access to the databases you created for assignment-1 and persisted in Hive metastore.
Once you have the SPARK software stack running, you should answer each of the questions listed below. Question1 will let you evaluate query performance and will provide hands-on experience in handling the framework. Question2 allows you to evaluate performance when data is cached in-between consecutive runs of the same query. Question3 will give you a basic understanding on how to write simple queries using SPARK APIs.
In this experiment, you will run each of the queries mentioned above which are located in /home/ubuntu/workload/sample-queries-tpcds. You will run a query at a time using SPARK SQL. As a general note, please clear the memory cache (sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches") and remove any content from your $SPARK_LOCAL_DIRS on every VM before any query run, in order to avoid any potential caching in memory or disk from previous runs. It is also recommended to restart the Thrift server every time you do an experiment.
You should start the Thrift server with the following configuration parameters:
start-thriftserver.sh --master spark://master_IP:7077 --driver-memory 1g --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/home/ubuntu/storage/logs --conf spark.executor.memory=21000m --conf spark.executor.cores=4 --conf spark.task.cpus=1 &.
For every query, you should compute the following:
Pick query21. In what follows, you will try different values for a specific parameter and see whether performance improves or not. Pick and fix the best value for that parameter and move to the next.
More specifically, you have to tune the following parameters:
In this experiment you will analyze performance in the presence of failures. For query 21, you should trigger two types of failures on a desired Worker VM when the job reaches 25% and 75% of its lifetime.
The two failure scenarios you should evaluate are the following:
You should analyze the variations in query job completion time. What type of failures impact performance more? Explain your observations.
As default, neither SPARK nor SPARK SQL is automatically persisting the RDDs in memory across jobs. Your task is to evaluate performance when various RDDs are persisted across consecutive runs of the same job. Unfortunately beeline tool used in Question1 does not allow you to persist RDDs. You need to encapsulate your SQL query in a Scala or Python script where you have fine grain control over different RDDs required for the query.
More precisely you should do the following:
For your report you should:
Note that for this question, you can use the sql method provided by SqlContext or HiveContext.
The goal here is for you to write a query using the SPARK API that, given two tables, will generate a specific output. You will run the query using spark-submit. Note that your script should not use a SqlContext or a HiveContext.
Figure below shows you the two tables involved in the query. The output of your script should contain: the names of the first five products which had the best sales, sorted in descending order and it should be persisted as a single file in HDFS.
When you write your script, you can use the same values for the SPARK properties as for Question2. You need to copy the input data provided (product.txt, sales.txt) into HDFS and load it into your script. For your report you should attach the script generated for the query. You should provide a brief write-up (single column) with your answers to the questions listed above. In addition, you should attach your source files used for questions 2 and 3. You should submit a group-X.tar.gz archive, by placing it in your assign2 hand-in directory for the course: ~cs838-3/assign2.Deliverables
Clarifications