Due Thursday, November 15, at the start of class.
Analyze a complicated log file from HTCondor and report on where jobs are being run.
Mostly, this is an exercise in using regular expressions. But the data being analyzed come from HTCondor, which is the software that manages the CHTC resources and their jobs here on campus. While it is too much to explain all of HTCondor and why these particular log entries might be interesting, at least you will get a hint of what is going on. And once we start talking about CHTC and HTCondor, the data may begin to make more sense.
HTCondor is a distributed high-throughput computing system developed and maintained here at the University of Wisconsin–Madison in the Computer Sciences department. It is the software that runs the CHTC facility. Essentially, HTCondor:
Your script will analyze one HTCondor log file to figure out which exact execution machine each job ran on.
Essentially, this script is similar to the word-frequency counting part of the script from homework #4. That is, the script will count how often certain events occurred and report on them.
However, there are significant differences this time:
The log file is available for download: homework-07-input.txt (4.1 MB, ~52K lines).
The precise description of the task requires many words and examples, but fortunately, the code is fairly easy to write in Python! For reference, my solution is less than 40 lines of code, and I wrote two functions to make things simpler and clearer. It takes much less than one second to run on the whole input file.
Here are a few lines from the file, with one example of an interesting line highlighted:
11/03/11 05:38:07 ****************************************************** 11/03/11 05:38:07 ** condor_shadow (CONDOR_SHADOW) STARTING UP 11/03/11 05:38:07 ** /usr/sbin/condor_shadow 11/03/11 05:38:07 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1) 11/03/11 05:38:07 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON 11/03/11 05:38:07 ** $CondorVersion: 7.7.2 Sep 27 2011 BuildID: 372978 PRE-RELEASE-UWCS $ 11/03/11 05:38:07 ** $CondorPlatform: x86_64_rhas_3 $ 11/03/11 05:38:07 ** PID = 31257 11/03/11 05:38:07 ** Log last touched 11/3 05:38:07 11/03/11 05:38:07 ****************************************************** 11/03/11 05:38:07 Using config source: /etc/condor/condor_config 11/03/11 05:38:07 Using local config sources: 11/03/11 05:38:07 /etc/condor/condor_config.local 11/03/11 05:38:07 /etc/condor/condor_config.global 11/03/11 05:38:07 /etc/condor/condor_config.special 11/03/11 05:38:07 DaemonCore: command socket at <128.104.55.9:46274?noUDP> 11/03/11 05:38:07 DaemonCore: private command socket at <128.104.55.9:46274> 11/03/11 05:38:07 Setting maximum accepts per cycle 8. 11/03/11 05:38:07 Initializing a VANILLA shadow for job 4786121.0 11/03/11 05:38:07 (4786121.0) (31257): Request to run on glidein_12311@compute-2-3.nys1 <192.168.9.242:34711?CCBID=128.104.59.136:9628#195853&noUDP> was DELAYED (previous job still being vacated) 11/03/11 05:38:08 (4786121.0) (31257): Request to run on glidein_12311@compute-2-3.nys1 <192.168.9.242:34711?CCBID=128.104.59.136:9628#195853&noUDP> was ACCEPTED 11/03/11 05:38:09 (4786116.0) (30040): Job 4786116.0 terminated: exited with status 8 11/03/11 05:38:09 (4786116.0) (30040): Reporting job exit reason 100 and attempting to fetch new job.
Notice the line immediately preceding the highlighted one. It is very similar in many respects, but ends differently. For this assignment, it is not an interesting line. For what it’s worth, the interesting lines reflect jobs that are accepted to be run on remote machines.
Below, I have extracted several of the interesting lines for comparison. Of course, there are lots of other log lines omitted between these samples.
11/03/11 08:06:50 (4744634.0) (6423): Request to run on slot3@tux-106.cae.wisc.edu <144.92.242.23:54087> was ACCEPTED 11/03/11 08:06:50 (4786880.0) (6226): Request to run on glidein_13822@compute-10-24.local <10.6.2.231:42106?CCBID=128.104.59.136:9641#191474&noUDP> was ACCEPTED 11/03/11 08:06:50 (4744653.0) (6426): Request to run on slot4@tux-106.cae.wisc.edu <144.92.242.23:54087> was ACCEPTED 11/03/11 08:06:50 (4786881.0) (6232): Request to run on glidein_16278@compute-11-14.local <10.6.2.209:34617?CCBID=128.104.59.136:9663#193511&noUDP> was ACCEPTED 11/03/11 08:06:50 (4744883.0) (6430): Request to run on slot1@tux-95.cae.wisc.edu <144.92.242.209:54084> was ACCEPTED 11/03/11 08:06:50 (4786882.0) (6227): Request to run on glidein_23769@compute-10-25.local <10.6.2.230:53123?CCBID=128.104.59.136:9649#201522&noUDP> was ACCEPTED
Notice that they all have the following format, where the parts that vary from line to line are highlighted:
DATE TIME (JOB-ID) (PID): Request to run on SLOT@HOST <HOST-ID> was ACCEPTED
Important: Your script must count information from these lines and only these lines.
The report needs only the SLOT and HOST parts of the lines above. Start with the SLOT part: Notice that some slot labels begin with “glidein_” and some do not? The final report must list the glide-in counts separately from the regular ones. Given that there are two parts to the report, how will you track the counts?
The HOST part of each interesting line identifies the machine that a job was matched to and ran on. The report counts the number of times each host occurs in these lines of the log file. But, notice that many of the hosts are very similar to each other, except for a number:
tux-48.cae.wisc.edu tux-63.cae.wisc.edu tux-38.cae.wisc.edu
So, before counting a host, change it as follows: Replace a sequence of one or more digits followed
by a dot character (.
) with one hash and one dot characters (#.
). Here are some examples:
Host Before | Host After |
---|---|
tux-48.cae.wisc.edu | tux-#.cae.wisc.edu |
glow-c188.cs.wisc.edu | glow-c#.cs.wisc.edu |
cabinet-4-4-25.t2.ucsd.edu | cabinet-4-4-#.t#.ucsd.edu |
That’s it! Count the occurrences of the modified hosts, separately for glide-in and regular slots.
Here is a sample report, run on a different input file:
GLIDE-INS 5 c#.local 1 cabinet-2-2-#.t#.ucsd.edu 1 cabinet-8-8-#.t#.ucsd.edu 2 compute-10-#.local 1 compute-11-#.local 2 compute-2-#.local 1 node#.local 11 tuscany#.med.harvard.edu REGULAR 7 athena.stat.wisc.edu 2 banquo.cs.wisc.edu 7 c#.chtc.wisc.edu 1 curds.cs.wisc.edu 5 darwin#.stat.wisc.edu 39 e#.chtc.wisc.edu 10 glow-c#.che.wisc.edu 60 glow-c#.cs.wisc.edu 1 gouda.cs.wisc.edu 1 staccato.cs.wisc.edu 8 star#.stat.wisc.edu 340 tux-#.cae.wisc.edu 3 tux-cons.cae.wisc.edu 3 tux-p#.cae.wisc.edu 19 zeus#.stat.wisc.edu
Now that you know about modules and functions, you *could* write Python code to test parts of your Python code. I have written a brief document to get you started, if you like. This is definitely a more advanced approach, but I am happy to answer questions about it.
Here are some specific tests to consider, for manual or automated testing:
Start your script the right way! Here is a suggestion:
#!/usr/bin/env python """Homework for CS 368-4 (2012 Fall) Assigned on Day 07, 2012-11-12 Written by <Your Name> """
Do the work yourself, consulting reasonable reference materials as needed. Any resource that provides a complete solution or offers significant material assistance toward a solution not OK to use. Asking the instructor for help is OK, asking other students for help is not. All standard UW policies concerning student conduct (esp. UWS 14) and information technology apply to this course and assignment.
All homework assignments must be turned in by email! Attach your Python script to the email as a
text .py
file. See the email page for more details about formatting, etc.