COW Job Scheduler
The COW Job Scheduler is similar to DJM, which the CM-5 uses.
With it you can submit jobs, and they will be run on the COW nodes.
You can monitor their progress, or kill them if you wish.
The Job Scheduler can run your job on all or part of an existing partition,
or create a temporary "one-job" partition for the duration of your job.
You should add "/p/cow/bin" to your path to use these commands.
csub & crun
csub submits a batch job.
crun submits an interactive job, that is
attached to your terminal.
csub and crun have the following options ...
- -nodes num
- run on num nodes. num need not be a multiple of 4.
- -range first-last
- run on a specific range of nodes.
- -time timeval
- timeval is the maximum amount of time that the
job can run. If it exceeds this amount, it
might be killed.
An integer normally means minutes, but you can add
an h, m, or s to the end to specify hours, minutes,
or seconds. E.g. 2 means 2 minutes. 2h means
2 hours.
NOTE: This is wall clock time, not CPU time, as
on the CM-5.
- -after time
- -at time
- Start the job at or after the specified time.
The default is to start the job as soon as possible.
If the time contains embedded quotes, it must
be enclosed in quotes, for example:
-at "Sept 4 12:30pm" See the -after option
in the jsub.1 man page for more info.
- -cube
- If the job is run on a subpartition (part of an
existing partition) or a "one-job" partition,
getcube will be called to create the appropriate
cube. This is only needed for jobs that use the
Partition Manager. You can also select this
behavior by setting the NEED_CUBE environment
variable before calling csub or crun.
- -part name
- The job must run on the named partition.
If not specified, the Job Scheduler will pick
a partition for you. It can even create a partition
for your job, and delete it when the job finishes.
- -pvm
- PVM will be started on the partition for the
duration of the job.
- -spread
- The job will be started on all nodes of the
partition. The default is for the job to
run only on the first node of the partition.
The job can then spread itself to the rest of
the partition as needed.
*** THIS OPTION IS NOT YET IMPLEMENTED. Please
let me know if you want to use it, and I will
make it a priority. --Glen
- filename [args]
- The name of your program, with optional arguments.
- -- filename [args]
- If the argument list has leading dashes, use this
form, and put it after Job Scheduler options.
If you omit the file name, run time, or number of nodes (unless
you specified a range), csub/crun will prompt you for them. Except for
the file name, the prompt will include a default value, which you
can accept by hitting the return key.
The Job Scheduler will create a file for your program's output.
Note that this only applies to the process(es) that are started
by the JS. If you start a process on another node, you will have to
handle your own output. Talk with me if you want to discuss ways to
handle this better in the future.
Note that AFS is not supported by the Job Scheduler yet, so you will
have to run your job from a public read/write directory to allow your
job to write its output file.
Lots more options are available, inherited from DJM. "csub -help" and
"man jsub" will give long, but unreliable lists of options, since the
job scheduler is still under development. Please tell me (glen@cs)
if anything you try has particularly bad results, and I will try to
either fix it or disable it. Also tell me if there is a feature that
is not implemented yet that you would find useful.
Examples:
% csub -nodes 4 -time 30 ./runshort
Job submitted successfully. Job id is 160.
% csub -range 8-11 -after 15:30 ./runshort
Estimated run (wall clock) time (5min)?
Job submitted successfully. Job id is 161.
% csub -nodes 4 -time 30 -- ./runshort arg1 arg2 arg3
Job submitted successfully. Job id is 166.
cstat
Shows queued and running jobs.
% cstat
USER JID PART PROCS TIME CPU STATUS COMMAND
glen 166 sched 4 0:05 0:00 Running pubwrite/./runshort
glen 158 j158 4 5:00 0:00 Que prt pubwrite/./runshort
Here is a list of the codes that indicate why a particular job is not running:
- 1j
- one-job partition is for a different job
- bsy
- non-timeshared partition is busy
- cnf
- reconfigure is imminent
- ded
- dedicated job
- dra
- draining partition(s) to run other job(s)
- dwn
- partition is down or halted
- gat
- blocked by gate
- mem
- not enough memory available
- nde
- not a dedicated job
- npr
- requested number of nodes not available
- opr
- operator has held this job
- pme
- wrong memory for partition
- pmr
- wrong mem * runtime for partition
- pnj
- partition run limit exceeded
- pqu
- wrong queue for partition
- prm
- no exec permission on partition
- prt
- partition not available
- pru
- wrong runtime for partition
- qnj
- too many running jobs in queue
- qnm
- not enough memory available (queue)
- spr
- specific nodes not available
- stp
- queue is stopped
- tkt
- Need kerberos ticket
- tmj
- too many running jobs (global)
- ts
- wait for timesharing
- unj
- too many running jobs by user
- usr
- user has held this job
- wai
- wait until indicated time
cstat proc
Lists the user processes on free nodes and the nodes of scheduled partitions.
Options ...
- nodenum
- List user processes only on given node.
- partname
- List user processes only on given partition.
- -wide
- Show full command. Default is to cut it off at 80 columns.
% cstat proc 11 -wide
USER JID NODE CPU CHLDCPU MEM PID COMMAND
glen - 11 0:00 0:00 0 1775 /usr/psup/pvm/lib/pvmd -s -d0 -ncowe11 1 80691a6c:8171 4096 4 80691a6f:0000
% cstat proc
USER JID NODE CPU CHLDCPU MEM PID COMMAND
glen 00963 8 0:00 0:00 0 1403 cstat proc
glen 00963 8 0:01 0:00 0 1389 /usr/psup/pvm/lib/pvmd
glen 00963 8 0:01 0:16 0 1377 -sh
glen - 9 0:00 0:00 0 1718 /usr/psup/pvm/lib/pvmd -s -
glen - 10 0:00 0:00 0 1498 /usr/psup/pvm/lib/pvmd -s -
glen - 11 0:00 0:00 0 1775 /usr/psup/pvm/lib/pvmd -s -
cstat ticket
Reports the status of your Kerberos tickets on the COW.
This is primarily to diagnose problems. If the master has a ticket for
you, so should all of the free nodes and scheduled partitions.
Submitting a new job will forward your current ticket to the master, unless
the master already has a ticket with an expiration time at least as late as
the current ticket's.
Options ...
- user
- Given username or usernum, it reports on that user.
The default is to report on the current user.
% cstat ticket
Kerberos tickets for glen:
Master: Ticket expires at May 13 14:55
Nodes 0 - 0: No ticket
Nodes 1 - 1: Ticket expires at May 13 14:55
Nodes 2 - 2: No ticket
Nodes 3 - 7: Ticket expires at May 13 14:55
Nodes 8 - 35: No ticket
cdel
Deletes a job. Must be owner or operator.
% cdel 161
Job deleted.
Environment Variables
The following environment variables are created in the running job:
- CUBE=glen.4
- The name of the partition, and of the cube, if any.
- PARTITION=glen.4
- The name of the partition, and of the cube, if any.
- PART_BASEPROC=4
- The number of the first node in the partition (cowe04)
- PART_SIZE=2
- Number of nodes in partition (cowe04-cowe05)
- PART_PROCNUM=0
- The relative node number within the partition.
- DJM_JOBID=2756
- The job number.
- JOB_DIR=/afs/cs.wisc.edu/u/staff/glen/cow/test
- The directory for this job.
- DJM_JOBNAME=runsho_2756
- The job name.
COW Information
Bolo's Home Page
Last Modified:
bolo (Josef Burger)
<bolo@cs.wisc.edu>