Computer Sciences Department logo

CS 368-4 (2012 Fall) — Day 7 Homework

Due Thursday, November 15, at the start of class.

Goal

Analyze a complicated log file from HTCondor and report on where jobs are being run.

Mostly, this is an exercise in using regular expressions. But the data being analyzed come from HTCondor, which is the software that manages the CHTC resources and their jobs here on campus. While it is too much to explain all of HTCondor and why these particular log entries might be interesting, at least you will get a hint of what is going on. And once we start talking about CHTC and HTCondor, the data may begin to make more sense.

Background Information

HTCondor is a distributed high-throughput computing system developed and maintained here at the University of Wisconsin–Madison in the Computer Sciences department. It is the software that runs the CHTC facility. Essentially, HTCondor:

Your script will analyze one HTCondor log file to figure out which exact execution machine each job ran on.

Tasks

Essentially, this script is similar to the word-frequency counting part of the script from homework #4. That is, the script will count how often certain events occurred and report on them.

However, there are significant differences this time:

The log file is available for download: homework-07-input.txt (4.1 MB, ~52K lines).

The precise description of the task requires many words and examples, but fortunately, the code is fairly easy to write in Python! For reference, my solution is less than 40 lines of code, and I wrote two functions to make things simpler and clearer. It takes much less than one second to run on the whole input file.

Selecting the Interesting Lines

Here are a few lines from the file, with one example of an interesting line highlighted:

11/03/11 05:38:07 ******************************************************
11/03/11 05:38:07 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/03/11 05:38:07 ** /usr/sbin/condor_shadow
11/03/11 05:38:07 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
11/03/11 05:38:07 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
11/03/11 05:38:07 ** $CondorVersion: 7.7.2 Sep 27 2011 BuildID: 372978 PRE-RELEASE-UWCS $
11/03/11 05:38:07 ** $CondorPlatform: x86_64_rhas_3 $
11/03/11 05:38:07 ** PID = 31257
11/03/11 05:38:07 ** Log last touched 11/3 05:38:07
11/03/11 05:38:07 ******************************************************
11/03/11 05:38:07 Using config source: /etc/condor/condor_config
11/03/11 05:38:07 Using local config sources: 
11/03/11 05:38:07    /etc/condor/condor_config.local
11/03/11 05:38:07    /etc/condor/condor_config.global
11/03/11 05:38:07    /etc/condor/condor_config.special
11/03/11 05:38:07 DaemonCore: command socket at <128.104.55.9:46274?noUDP>
11/03/11 05:38:07 DaemonCore: private command socket at <128.104.55.9:46274>
11/03/11 05:38:07 Setting maximum accepts per cycle 8.
11/03/11 05:38:07 Initializing a VANILLA shadow for job 4786121.0
11/03/11 05:38:07 (4786121.0) (31257): Request to run on glidein_12311@compute-2-3.nys1 <192.168.9.242:34711?CCBID=128.104.59.136:9628#195853&noUDP> was DELAYED (previous job still being vacated)
11/03/11 05:38:08 (4786121.0) (31257): Request to run on glidein_12311@compute-2-3.nys1 <192.168.9.242:34711?CCBID=128.104.59.136:9628#195853&noUDP> was ACCEPTED
11/03/11 05:38:09 (4786116.0) (30040): Job 4786116.0 terminated: exited with status 8
11/03/11 05:38:09 (4786116.0) (30040): Reporting job exit reason 100 and attempting to fetch new job.

Notice the line immediately preceding the highlighted one. It is very similar in many respects, but ends differently. For this assignment, it is not an interesting line. For what it’s worth, the interesting lines reflect jobs that are accepted to be run on remote machines.

Below, I have extracted several of the interesting lines for comparison. Of course, there are lots of other log lines omitted between these samples.

11/03/11 08:06:50 (4744634.0) (6423): Request to run on slot3@tux-106.cae.wisc.edu <144.92.242.23:54087> was ACCEPTED
11/03/11 08:06:50 (4786880.0) (6226): Request to run on glidein_13822@compute-10-24.local <10.6.2.231:42106?CCBID=128.104.59.136:9641#191474&noUDP> was ACCEPTED
11/03/11 08:06:50 (4744653.0) (6426): Request to run on slot4@tux-106.cae.wisc.edu <144.92.242.23:54087> was ACCEPTED
11/03/11 08:06:50 (4786881.0) (6232): Request to run on glidein_16278@compute-11-14.local <10.6.2.209:34617?CCBID=128.104.59.136:9663#193511&noUDP> was ACCEPTED
11/03/11 08:06:50 (4744883.0) (6430): Request to run on slot1@tux-95.cae.wisc.edu <144.92.242.209:54084> was ACCEPTED
11/03/11 08:06:50 (4786882.0) (6227): Request to run on glidein_23769@compute-10-25.local <10.6.2.230:53123?CCBID=128.104.59.136:9649#201522&noUDP> was ACCEPTED

Notice that they all have the following format, where the parts that vary from line to line are highlighted:

DATE TIME (JOB-ID) (PID): Request to run on SLOT@HOST <HOST-ID> was ACCEPTED

Important: Your script must count information from these lines and only these lines.

Dividing the Interesting Lines by Slot Type

The report needs only the SLOT and HOST parts of the lines above. Start with the SLOT part: Notice that some slot labels begin with “glidein_” and some do not? The final report must list the glide-in counts separately from the regular ones. Given that there are two parts to the report, how will you track the counts?

Counting Hosts

The HOST part of each interesting line identifies the machine that a job was matched to and ran on. The report counts the number of times each host occurs in these lines of the log file. But, notice that many of the hosts are very similar to each other, except for a number:

tux-48.cae.wisc.edu
tux-63.cae.wisc.edu
tux-38.cae.wisc.edu

So, before counting a host, change it as follows: Replace a sequence of one or more digits followed by a dot character (.) with one hash and one dot characters (#.). Here are some examples:

Host BeforeHost After
tux-48.cae.wisc.edutux-#.cae.wisc.edu
glow-c188.cs.wisc.eduglow-c#.cs.wisc.edu
cabinet-4-4-25.t2.ucsd.educabinet-4-4-#.t#.ucsd.edu

That’s it! Count the occurrences of the modified hosts, separately for glide-in and regular slots.

Sample Output

Here is a sample report, run on a different input file:

GLIDE-INS
  5 c#.local
  1 cabinet-2-2-#.t#.ucsd.edu
  1 cabinet-8-8-#.t#.ucsd.edu
  2 compute-10-#.local
  1 compute-11-#.local
  2 compute-2-#.local
  1 node#.local
 11 tuscany#.med.harvard.edu

REGULAR
  7 athena.stat.wisc.edu
  2 banquo.cs.wisc.edu
  7 c#.chtc.wisc.edu
  1 curds.cs.wisc.edu
  5 darwin#.stat.wisc.edu
 39 e#.chtc.wisc.edu
 10 glow-c#.che.wisc.edu
 60 glow-c#.cs.wisc.edu
  1 gouda.cs.wisc.edu
  1 staccato.cs.wisc.edu
  8 star#.stat.wisc.edu
340 tux-#.cae.wisc.edu
  3 tux-cons.cae.wisc.edu
  3 tux-p#.cae.wisc.edu
 19 zeus#.stat.wisc.edu

Testing

Now that you know about modules and functions, you *could* write Python code to test parts of your Python code. I have written a brief document to get you started, if you like. This is definitely a more advanced approach, but I am happy to answer questions about it.

Here are some specific tests to consider, for manual or automated testing:

Reminders

Start your script the right way! Here is a suggestion:

#!/usr/bin/env python

"""Homework for CS 368-4 (2012 Fall)
Assigned on Day 07, 2012-11-12
Written by <Your Name>
"""

Do the work yourself, consulting reasonable reference materials as needed. Any resource that provides a complete solution or offers significant material assistance toward a solution not OK to use. Asking the instructor for help is OK, asking other students for help is not. All standard UW policies concerning student conduct (esp. UWS 14) and information technology apply to this course and assignment.

Hand In

All homework assignments must be turned in by email! Attach your Python script to the email as a text .py file. See the email page for more details about formatting, etc.