Coordinating set of jobs: A simple DAG

Objective of this exercise

The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job.

What is DAGMan?

Your tutorial leader will introduce you to DAGMan and DAGs. In short, DAGMAn, lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters:

DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found at in the Condor manual.

Submitting a simple DAG

We're going to go back to the "simple" example that we did first. (The one with the job that slept and multiplied a number by 2.) Make sure that you have a submit file has only one queue command in it, as when we first wrote it. And we will just run vanilla universe jobs for now, though we could equally well run standard universe jobs.

Universe                = vanilla
Executable              = simple
Arguments               = 4 10
Log                     = simple.log
Output                  = simple.out
Error                   = simple.error
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
Queue

Make sure you've built the simple program. If you need to, go back to the instructions for your first job to do it again.

We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open. In one window, you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does.

First we will create the most minimal DAG that can be created: a DAG with just one node. Put this into a file named simple.dag.

Job Simple submit

In your first window, submit the DAG:

% condor_submit_dag simple.dag

-----------------------------------------------------------------------
File for submitting this DAG to Condor           : simple.dag.condor.sub
Log of DAGMan debugging messages                 : simple.dag.dagman.out
Log of Condor library output                     : simple.dag.lib.out
Log of Condor library error messages             : simple.dag.lib.err
Log of the life of condor_dagman itself          : simple.dag.dagman.log

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 60482.
-----------------------------------------------------------------------

% condor_reschedule    May be necesary, but only if things seem a bit sluggish 

Sent "Reschedule" command to local schedd

In the second window, watch the queue:

% watch -n 10 condor_q -sub YOUR-LOGIN-NAME

-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
60482.0   zmiller         7/9  18:14   0+00:14:19 R  0   7.3  condor_dagman -f -
60483.0   zmiller         7/9  18:15   0+00:00:02 R  0   0.7  simple 20 10      

2 jobs; 0 idle, 2 running, 0 held

-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
60482.0   zmiller         7/9  18:14   0+00:14:28 R  0   7.3  condor_dagman -f -
60483.0   zmiller         7/9  18:15   0+00:00:11 R  0   0.7  simple 20 10      

2 jobs; 0 idle, 2 running, 0 held

-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
60482.0   zmiller         7/9  18:14   0+00:14:37 R  0   7.3  condor_dagman -f -
60483.0   zmiller         7/9  18:15   0+00:00:20 R  0   0.7  simple 20 10      

2 jobs; 0 idle, 2 running, 0 held

-- Submitter: treinamento01.ncc.unesp.br : <200.145.46.65:56001> : treinamento01.ncc.unesp.br
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

Ctrl-C

In the third window, watch what DAGMan does:

%  tail -f --lines=500 simple.dag.dagman.out
12/07 07:27:47 ******************************************************
12/07 07:27:47 ** condor_scheduniv_exec.20.0 (CONDOR_DAGMAN) STARTING UP
12/07 07:27:47 ** /opt/condor/bin/condor_dagman
12/07 07:27:47 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
12/07 07:27:47 ** Configuration: subsystem:DAGMAN local: class:DAEMON
12/07 07:27:47 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
12/07 07:27:47 ** $CondorPlatform: X86_64-LINUX_DEBIAN50 $
12/07 07:27:47 ** PID = 3998
12/07 07:27:47 ** Log last touched time unavailable (No such file or directory)
12/07 07:27:47 ******************************************************
12/07 07:27:47 Using config source: /opt/condor/etc/condor_config
12/07 07:27:47 Using local config sources: 
12/07 07:27:47    /opt/condor/scratch/condor_config.local
12/07 07:27:47 DaemonCore: Command Socket at <200.145.46.65:56466>
12/07 07:27:47 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
12/07 07:27:47 DAGMAN_DEBUG_CACHE_ENABLE setting: False
12/07 07:27:47 DAGMAN_SUBMIT_DELAY setting: 0
12/07 07:27:47 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
12/07 07:27:47 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
12/07 07:27:47 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
12/07 07:27:47 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5
12/07 07:27:47 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
12/07 07:27:47 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
12/07 07:27:47 DAGMAN_RETRY_NODE_FIRST setting: 0
12/07 07:27:47 DAGMAN_MAX_JOBS_IDLE setting: 0
12/07 07:27:47 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
12/07 07:27:47 DAGMAN_MUNGE_NODE_NAMES setting: 1
12/07 07:27:47 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
12/07 07:27:47 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
12/07 07:27:47 DAGMAN_ABORT_DUPLICATES setting: 1
12/07 07:27:47 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
12/07 07:27:47 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
12/07 07:27:47 DAGMAN_AUTO_RESCUE setting: 1
12/07 07:27:47 DAGMAN_MAX_RESCUE_NUM setting: 100
12/07 07:27:47 DAGMAN_DEFAULT_NODE_LOG setting: null
12/07 07:27:47 ALL_DEBUG setting: 
12/07 07:27:47 DAGMAN_DEBUG setting: 
12/07 07:27:47 argv[0] == "condor_scheduniv_exec.20.0"
12/07 07:27:47 argv[1] == "-Debug"
12/07 07:27:47 argv[2] == "3"
12/07 07:27:47 argv[3] == "-Lockfile"
12/07 07:27:47 argv[4] == "simple.dag.lock"
12/07 07:27:47 argv[5] == "-AutoRescue"
12/07 07:27:47 argv[6] == "1"
12/07 07:27:47 argv[7] == "-DoRescueFrom"
12/07 07:27:47 argv[8] == "0"
12/07 07:27:47 argv[9] == "-Dag"
12/07 07:27:47 argv[10] == "simple.dag"
12/07 07:27:47 argv[11] == "-CsdVersion"
12/07 07:27:47 argv[12] == "$CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $"
12/07 07:27:47 Default node log file is: 
12/07 07:27:47 DAG Lockfile will be written to simple.dag.lock
12/07 07:27:47 DAG Input file is simple.dag
12/07 07:27:47 Parsing 1 dagfiles
12/07 07:27:47 Parsing simple.dag ...
12/07 07:27:47 Dag contains 1 total jobs
12/07 07:27:47 Sleeping for 12 seconds to ensure ProcessId uniqueness
12/07 07:27:59 Bootstrapping...
12/07 07:27:59 Number of pre-completed nodes: 0
12/07 07:27:59 Registering condor_event_timer...
12/07 07:28:00 Sleeping for one second for log file consistency
12/07 07:28:01 MultiLogFiles: macros not allowed in log file name in DAG node submit files
12/07 07:28:01 Unable to get log file from submit file submit (node Simple); using default (/home/zmiller/condor-test/simple.dag.nodes.log)
12/07 07:28:01 Submitting Condor Node Simple job(s)...
12/07 07:28:01 submitting: condor_submit -a dag_node_name' '=' 'Simple -a +DAGManJobId' '=' '20 -a DAGManJobId' '=' '20 -a submit_event_notes' '=' 'DAG' 'Node:' 'Simple -a log' '=' '/home/zmiller/condor-test/simple.dag.nodes.log -a +DAGParentNodeNames' '=' '"" submit
12/07 07:28:01 From submit: Submitting job(s).
12/07 07:28:01 From submit: Logging submit event(s).
12/07 07:28:01 From submit: 1 job(s) submitted to cluster 21.
12/07 07:28:01  assigned Condor ID (21.0)
12/07 07:28:01 Just submitted 1 job this cycle...
12/07 07:28:01 Currently monitoring 1 Condor log file(s)
12/07 07:28:01 Event: ULOG_SUBMIT for Condor Node Simple (21.0)
12/07 07:28:01 Number of idle job procs: 1
12/07 07:28:01 Of 1 nodes total:
12/07 07:28:01  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/07 07:28:01   ===     ===      ===     ===     ===        ===      ===
12/07 07:28:01     0       0        1       0       0          0        0
12/07 07:28:11 Currently monitoring 1 Condor log file(s)
12/07 07:28:11 Event: ULOG_EXECUTE for Condor Node Simple (21.0)
12/07 07:28:11 Number of idle job procs: 0
12/07 07:29:11 Currently monitoring 1 Condor log file(s)
12/07 07:29:11 Event: ULOG_JOB_TERMINATED for Condor Node Simple (21.0)
12/07 07:29:11 Node Simple job proc (21.0) completed successfully.
12/07 07:29:11 Node Simple job completed
12/07 07:29:11 Number of idle job procs: 0
12/07 07:29:11 Of 1 nodes total:
12/07 07:29:11  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/07 07:29:11   ===     ===      ===     ===     ===        ===      ===
12/07 07:29:11     1       0        0       0       0          0        0
12/07 07:29:11 All jobs Completed!
12/07 07:29:11 Note: 0 total job deferrals because of -MaxJobs limit (0)
12/07 07:29:11 Note: 0 total job deferrals because of -MaxIdle limit (0)
12/07 07:29:11 Note: 0 total job deferrals because of node category throttles
12/07 07:29:11 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
12/07 07:29:11 Note: 0 total POST script deferrals because of -MaxPost limit (0)
12/07 07:29:11 **** condor_scheduniv_exec.20.0 (condor_DAGMAN) pid 3998 EXITING WITH STATUS 0

Now verify your results:

% cat simple.log

000 (010.000.000) 12/07 05:33:52 Job submitted from host: <200.145.46.65:56001>
...
001 (010.000.000) 12/07 05:33:54 Job executing on host: <200.145.46.74:49619>
...
005 (010.000.000) 12/07 05:33:58 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        56  -  Run Bytes Sent By Job
        8619  -  Run Bytes Received By Job
        56  -  Total Bytes Sent By Job
        8619  -  Total Bytes Received By Job
...
% cat simple.out
Thinking really hard for 4 seconds...
We calculated: 20

Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a scheduler universe job).

% ls simple.dag.*
simple.dag.condor.sub  simple.dag.dagman.log  simple.dag.dagman.out  simple.dag.lib.err  simple.dag.lib.out

% cat simple.dag.condor.sub
# Filename: simple.dag.condor.sub
# Generated by condor_submit_dag simple.dag 
universe        = scheduler
executable      = /opt/condor/bin/condor_dagman
getenv          = True
output          = simple.dag.lib.out
error           = simple.dag.lib.err
log             = simple.dag.dagman.log
remove_kill_sig = SIGUSR1
# Note: default on_exit_remove expression:
# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
# attempts to ensure that DAGMan is automatically
# requeued by the schedd if it exits abnormally or
# is killed (e.g., during a reboot).
on_exit_remove  = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
copy_to_spool   = False
arguments       = "-f -l . -Debug 3 -Lockfile simple.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag simple.dag -CsdVersion $CondorVersion:' '7.4.4' 'Oct' '13' '2010' 'BuildID:' '279383' '$"
environment     = _CONDOR_DAGMAN_LOG=simple.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
queue

Clean up some of these files:

% rm simple.dag.*

On your own

Why does DAGman run as a Condor job?
Look at the submit file for DAGMan: what does on_exit_remove do? Why is this here?

Challenge

What is the scheduler universe? Why does DAGMan use it?