A few tips and tricks
Objective of this exercise
This exercise will teach you a few nifty commands to help you use Condor more easily.
You're been using condor_q
, but it shows the jobs of everyone who submitted from the computer you are running the command on. If you want to see just your jobs, try this:
% condor_q -sub YOUR-LOGIN-NAME
Curious where your jobs are running? Use the -run
option to see where jobs are running. (Idle jobs are not shown.)
% condor_q -run
-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
ID OWNER SUBMITTED RUN_TIME HOST(S)
14.0 zmiller 12/7 05:44 0+00:00:01 slot1@treinamento10.ncc.unesp.br
14.1 zmiller 12/7 05:44 0+00:00:01 slot2@treinamento10.ncc.unesp.br
14.2 zmiller 12/7 05:44 0+00:00:01 slot3@treinamento10.ncc.unesp.br
condor_q
can also show you your job ClassAd. Recall back to the lecture and the discussion of ClassAds. For instance, you can look at the ClassAd for a single job:
% condor_q -l 14
-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
ClusterId = 16
QDate = 1291707958
CompletionDate = 0
Owner = "zmiller"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
... output trimmed ...
There are some interesting parts you can check out. For instance, where is the job running?
% condor_q -l 14 | grep RemoteHost
RemoteHost = "slot1@treinamento10.ncc.unesp.br"
How many times has this job run? (It might be more than one if there were recoverable errors.)
% condor_q -l 14 | grep JobRunCount
JobRunCount = 1
Where is the user log for this job? This is helpful when you assist someone else in debugging and they're not sure.
% condor_q -l 14 | grep UserLog
UserLog = "/home/zmiller/condor-test/simple.0.log"
What the job's requirements? Condor automatically fills some in for you, to make sure your job runs on a reasonable computer in our cluster, but you can override any of these. I've broken the output into multiple lines to explain it to you.
% condor_q -l 64049 | grep Requirements
Requirements =
(Arch == "X86_64") # Make sure you run on the same architecture.
&& (OpSys == "LINUX") # Make sure you run on Linux
&& (Disk >= DiskUsage) # Make sure the default disk Condor is on has enough disk space for your executable. Question: What is DiskUsage?
&& (((Memory * 1024) >= ImageSize) # Make sure there is enough memory for your executable. Question: What is ImageSize?
&& ((RequestMemory * 1024) >= ImageSize)) # Question: What is RequestMemory?
&& (HasFileTransfer) # Only run on a computer that can accept your files.
Curious about what jobs are being run right now?
% condor_status -claimed
Name OpSys Arch LoadAv RemoteUser ClientMachine
slot1@treinam LINUX X86_64 0.000 zmiller@ncc.unesp.br treinamento01.n
Machines MIPS KFLOPS AvgLoadAvg
X86_64/LINUX 1 9486 1502574 0.000
Total 1 9486 1502574 0.000
Curious about who has submitted jobs?
% condor_status -submitters
Name Machine Running IdleJobs HeldJobs
zmiller@ncc.unesp.br treinament 0 1 0
RunningJobs IdleJobs HeldJobs
zmiller@ncc.unesp.br 0 1 0
Total 0 1 0
Or perhaps you want to know all the machines from which you can submit jobs:
% condor_status -schedd
TotalRunningJobs TotalIdleJobs TotalHeldJobs
condor_status -schedd
Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs
treinamento01.ncc.un treinament 0 0 0
treinamento02.ncc.un treinament 0 0 0
treinamento03.ncc.un treinament 0 0 0
treinamento04.ncc.un treinament 0 0 0
...
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 0 0 0
Just like we could look at the ClassAd for a job, we can also look at them for computers.
% condor_status -l slot1@treinamento01.ncc.unesp.br
MyType = "Machine"
TargetType = "Job"
Name = "slot1@treinamento01.ncc.unesp.br"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
SlotWeight = Cpus
Unhibernate = MY.MachineLastMatchTime =!= UNDEFINED
MyCurrentTime = 1291708590
Machine = "treinamento01.ncc.unesp.br"
PublicNetworkIpAddr = "<200.145.46.65:46864>"
COLLECTOR_HOST_STRING = "treinamento02.ncc.unesp.br"
CondorVersion = "$CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $"
... output trimmed ...
Some features of interest:
# The computer's name
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep -i Name
Name = "slot1@treinamento01.ncc.unesp.br"
# The computer's other name.
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep -i ^Machine
Machine = "treinamento01.ncc.unesp.br"
# The state of the computer?
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep -i ^State
State = "Unclaimed"
# The version of Condor this computer is running.
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep CondorVersion
CondorVersion = "$CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $"
# How many CPUs this computer has
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep TotalCpus
TotalCpus = 4
What else can you find that's interesting in the ClassAd?
If you submit a job that you realize has a problem, you can remove it with condor_rm
. For example:
% condor_q
-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
19.0 zmiller 12/7 06:01 0+00:00:04 R 0 0.0 simple 60 10
1 jobs; 0 idle, 1 running, 0 held
% condor_rm 19
Cluster 19 has been marked for removal.
% condor_q
-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
A few tips:
You can see information about jobs that completed and are no longer in the queue with the =condor_history_ command. It's rare that you want to see all the jobs, so try looking at jobs for just you:
condor_history USERNAME
For example:
% condor_history zmiller
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
60244.0 zmiller 6/24 17:25 0+00:00:00 X ??? /home/zmiller/test/
60244.1 zmiller 6/24 17:25 0+00:00:00 X ??? /home/zmiller/test/
60244.2 zmiller 6/24 17:25 0+00:00:00 X ??? /home/zmiller/test/
60244.3 zmiller 6/24 17:25 0+00:00:00 X ??? /home/zmiller/test/
60244.4 zmiller 6/24 17:25 0+00:00:00 X ??? /home/zmiller/test/
60244.5 zmiller 6/24 17:25 0+00:00:00 X ??? /home/zmiller/test/
60246.0 zmiller 6/25 09:12 0+00:00:00 X ??? /home/zmiller/test/
60246.1 zmiller 6/25 09:12 0+00:00:00 X ??? /home/zmiller/test/
60246.2 zmiller 6/25 09:12 0+00:00:00 X ??? /home/zmiller/test/
60246.3 zmiller 6/25 09:12 0+00:00:00 X ??? /home/zmiller/test/
60246.4 zmiller 6/25 09:12 0+00:00:00 X ??? /home/zmiller/test/
60246.5 zmiller 6/25 09:12 0+00:00:00 X ??? /home/zmiller/test/
60251.0 zmiller 6/25 17:53 0+00:00:36 C 6/25 17:56 /home/zmiller/test/
60252.0 zmiller 6/25 17:57 0+00:00:36 C 6/25 18:06 /home/zmiller/test/
60253.4 zmiller 6/25 18:08 0+00:00:01 C 6/25 18:08 /home/zmiller/test/
60253.1 zmiller 6/25 18:08 0+00:00:36 C 6/25 18:11 /home/zmiller/test/
60253.5 zmiller 6/25 18:08 0+00:00:42 C 6/25 18:11 /home/zmiller/test/
...