A few tips and tricks

Objective of this exercise

This exercise will teach you a few nifty commands to help you use Condor more easily.

Tips for condor_q

You're been using condor_q, but it shows the jobs of everyone who submitted from the computer you are running the command on. If you want to see just your jobs, try this:


% condor_q -sub YOUR-LOGIN-NAME

Curious where your jobs are running? Use the -run option to see where jobs are running. (Idle jobs are not shown.)

% condor_q -run

-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)         
  14.0   zmiller        12/7  05:44   0+00:00:01 slot1@treinamento10.ncc.unesp.br
  14.1   zmiller        12/7  05:44   0+00:00:01 slot2@treinamento10.ncc.unesp.br
  14.2   zmiller        12/7  05:44   0+00:00:01 slot3@treinamento10.ncc.unesp.br

condor_q can also show you your job ClassAd. Recall back to the lecture and the discussion of ClassAds. For instance, you can look at the ClassAd for a single job:

% condor_q -l 14

-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
ClusterId = 16
QDate = 1291707958
CompletionDate = 0
Owner = "zmiller"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
... output trimmed ... 

There are some interesting parts you can check out. For instance, where is the job running?

% condor_q -l 14 | grep RemoteHost
RemoteHost = "slot1@treinamento10.ncc.unesp.br"

How many times has this job run? (It might be more than one if there were recoverable errors.)

% condor_q -l 14 | grep JobRunCount
JobRunCount = 1

Where is the user log for this job? This is helpful when you assist someone else in debugging and they're not sure.

% condor_q -l 14 | grep UserLog
UserLog = "/home/zmiller/condor-test/simple.0.log"

What the job's requirements? Condor automatically fills some in for you, to make sure your job runs on a reasonable computer in our cluster, but you can override any of these. I've broken the output into multiple lines to explain it to you.

% condor_q -l 64049 | grep Requirements
Requirements = 
   (Arch == "X86_64") #  Make sure you run on the same architecture.
    && (OpSys == "LINUX") #  Make sure you run on Linux 

    && (Disk >= DiskUsage) #  Make sure the default disk Condor is on has enough disk space for your executable. Question: What is DiskUsage? 
    && (((Memory * 1024) >= ImageSize)  #  Make sure there is enough memory for your executable. Question: What is ImageSize? 
    && ((RequestMemory * 1024) >= ImageSize))  #  Question: What is RequestMemory? 

    && (HasFileTransfer) #  Only run on a computer that can accept your files. 

Tips for condor_status

Curious about what jobs are being run right now?

% condor_status -claimed


Name          OpSys       Arch   LoadAv RemoteUser           ClientMachine  

slot1@treinam LINUX       X86_64 0.000  zmiller@ncc.unesp.br treinamento01.n
                     Machines         MIPS       KFLOPS   AvgLoadAvg

        X86_64/LINUX        1         9486      1502574   0.000

               Total        1         9486      1502574   0.000

Curious about who has submitted jobs?

% condor_status -submitters

Name                 Machine      Running IdleJobs HeldJobs

zmiller@ncc.unesp.br treinament         0        1        0
                           RunningJobs           IdleJobs           HeldJobs

zmiller@ncc.unesp.br                 0                  1                  0

               Total                 0                  1                  0

Or perhaps you want to know all the machines from which you can submit jobs:

% condor_status -schedd
                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs
condor_status -schedd

Name                 Machine    TotalRunningJobs TotalIdleJobs TotalHeldJobs 

treinamento01.ncc.un treinament                0             0              0
treinamento02.ncc.un treinament                0             0              0
treinamento03.ncc.un treinament                0             0              0
treinamento04.ncc.un treinament                0             0              0
...
                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs
                    
               Total                 0                  0                  0

Just like we could look at the ClassAd for a job, we can also look at them for computers.

% condor_status -l slot1@treinamento01.ncc.unesp.br
MyType = "Machine"
TargetType = "Job"
Name = "slot1@treinamento01.ncc.unesp.br"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
SlotWeight = Cpus
Unhibernate = MY.MachineLastMatchTime =!= UNDEFINED
MyCurrentTime = 1291708590
Machine = "treinamento01.ncc.unesp.br"
PublicNetworkIpAddr = "<200.145.46.65:46864>"
COLLECTOR_HOST_STRING = "treinamento02.ncc.unesp.br"
CondorVersion = "$CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $"
... output trimmed ...

Some features of interest:

# The computer's name 
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep -i Name
Name = "slot1@treinamento01.ncc.unesp.br"

# The computer's other name. 
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep -i ^Machine
Machine = "treinamento01.ncc.unesp.br"

# The state of the computer? 
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep -i ^State
State = "Unclaimed"

# The version of Condor this computer is running. 
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep CondorVersion
CondorVersion = "$CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $"

# How many CPUs this computer has 
% condor_status -l slot1@treinamento01.ncc.unesp.br | grep  TotalCpus
TotalCpus = 4

What else can you find that's interesting in the ClassAd?

Removing jobs

If you submit a job that you realize has a problem, you can remove it with condor_rm. For example:

% condor_q
-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  19.0   zmiller        12/7  06:01   0+00:00:04 R  0   0.0  simple 60 10      

1 jobs; 0 idle, 1 running, 0 held

% condor_rm 19
Cluster 19 has been marked for removal.

 % condor_q
-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

A few tips:

Historical information

You can see information about jobs that completed and are no longer in the queue with the =condor_history_ command. It's rare that you want to see all the jobs, so try looking at jobs for just you:

condor_history USERNAME

For example:

 % condor_history zmiller
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD            
60244.0   zmiller             6/24 17:25   0+00:00:00 X   ???        /home/zmiller/test/
60244.1   zmiller             6/24 17:25   0+00:00:00 X   ???        /home/zmiller/test/
60244.2   zmiller             6/24 17:25   0+00:00:00 X   ???        /home/zmiller/test/
60244.3   zmiller             6/24 17:25   0+00:00:00 X   ???        /home/zmiller/test/
60244.4   zmiller             6/24 17:25   0+00:00:00 X   ???        /home/zmiller/test/
60244.5   zmiller             6/24 17:25   0+00:00:00 X   ???        /home/zmiller/test/
60246.0   zmiller             6/25 09:12   0+00:00:00 X   ???        /home/zmiller/test/
60246.1   zmiller             6/25 09:12   0+00:00:00 X   ???        /home/zmiller/test/
60246.2   zmiller             6/25 09:12   0+00:00:00 X   ???        /home/zmiller/test/
60246.3   zmiller             6/25 09:12   0+00:00:00 X   ???        /home/zmiller/test/
60246.4   zmiller             6/25 09:12   0+00:00:00 X   ???        /home/zmiller/test/
60246.5   zmiller             6/25 09:12   0+00:00:00 X   ???        /home/zmiller/test/
60251.0   zmiller             6/25 17:53   0+00:00:36 C   6/25 17:56 /home/zmiller/test/
60252.0   zmiller             6/25 17:57   0+00:00:36 C   6/25 18:06 /home/zmiller/test/
60253.4   zmiller             6/25 18:08   0+00:00:01 C   6/25 18:08 /home/zmiller/test/
60253.1   zmiller             6/25 18:08   0+00:00:36 C   6/25 18:11 /home/zmiller/test/
60253.5   zmiller             6/25 18:08   0+00:00:42 C   6/25 18:11 /home/zmiller/test/
...