Subsections

Interfacing with Existing Batch Systems

CL-MW is designed to work with existing batch systems. CL-MW has no provision for starting up slaves on remote machines, moving files between machines, detecting and killing run away slaves, managing user identity, storing or transmitting credentials, or enforcing authorization policies for resource use. These, and many other features, are general features which batch systems usually provide. In general, batch systems provide very robust mechanisms for each of these features and are well suited to handle many edge cases which show up in practice. Usually a CL-MW application will be sharing resources with other applications across a common cluster of machines.

What CL-MW does provide is a means for communicating to a batch system called the resource file. The master process, when configured to do so, periodically writes information into the resource file, including liveness of the master process, if the computation is still in progress, how many slaves the master desires, and how to start up those slaves. The master process adjusts the resource requirement information in the file based upon the outstanding workload and the slaves successfully started by the batch system.

The resource file is written when the command line option
--mw-resource-file file-name is used with the master process. By default the master will rewrite it every 300 seconds (5 minutes) with new information. This can be controlled by the command line option
--mw-resource-file-update-interval integral-seconds.

The resource file contains information in Lisp form and is intended to be read by another process specific to the batch scheduler whose responsibility it is to acquire resources from said batch scheduler. The batch scheduler is directly notified in the resource file if the master algorithm 's computation is in progress or finished.

The slave process can also read the resource file in order to determine the master host, port, and member-id to which it should connect and if the computation is still in progress or finished. Depending upon the batch scheduler, having the slave read the resource file directly can be a very helpful because the dynamic connection information needed by the slave is available in the file. There is no requirement to adjust meta-information for the slave (e.g., its command line arguments which may specify a new master ip:port combination) through the batch system itself.

The internals of the resource file are described on page [*].

Condor

Condor is a versatile, robust, and free batch scheduling system available from http://www.cs.wisc.edu/condor. It can maintain high job throughput for tens of thousands of jobs on tens of thousands of resources. Condor's built in mechanisms for file transfer, job execution policies, and robustness mechanisms make it a good distributed computing platform on which to run CL-MW applications. We describe a simple application of Condor which provides a reliable execution platform for a CL-MW application.

Interfacing CL-MW with Condor

Condor can be told to transfer the job's input files to the remote execute node just before the job is about to start. Condor will do this each time a job tries to run when it has been sitting idle in the queue. We use this fact to copy over the resource file written by the master over to the slave so the slave knows where to connect.

We use a Condor feature that can control when a terminated job is to removed from the queue. We forbid the slave job to be removed from the queue unless the slave gets told to shutdown by the master or otherwise exits with the return code of zero. If the master dies without producing a completed answer, the slaves, having noticed that the master closed the connection without having been told to explicitly shut down, will exit with a non-zero status. The slave jobs will remain in the job queue due to the Condor enforced job policy. If the slave runs again and tries to connect using a stale resource file (due to the master not running for an extended period of time), the slave will again exit with a non-zero status value, again remaining in the queue.

We also submit the master process itself as a job into the Condor system. The master will always run on the submit node as opposed to being shipped to an execute node. This allows the master access to the needed input files or other resources usually only found on the submission machine and owned by the submitter of the job. We do not perform the same type of job policy for the master as we did for the slaves. If the master exits with any return code, it will be removed from the job queue. However, in the event of machine reboot (or many other types of failures), Condor will, when it starts running again, know that the master and slave processes were present in the job queue. It will restart the master and again find more resources for the slave jobs. When the master restarts it will write a new resource file. When a slave runs again, the new resource file gets moved to it and it will connect with the currently running master. Finally, if the master has finished and written the final update into the resource file stating the computation is finished, then any slaves that start up with that particular resource file will immediately exit with a status zero. This allows the slaves to be removed from the queue. If a slave is told to shutdown properly by the master, and does so, then it will exit with an exit code of 0 and also be removed from the queue.

The combination of the slave job policy and the restart robustness of the master makes CL-MW jobs reliable. In the event of a restart of the master process, the entire computation will restart. However, the entire computation will reliably restart and run until it finally completes.

The particular interface described in this section is: simple, doesn't make full use of the information the resource file, and wastes compute time on the execution nodes during the time the master process is down. It could be extended with an actual process outside of CL-MW and managed by Condor which watches the resource file and actively submits and removes slaves from Condor based on the master's current resource needs.

A Simple Interface to Condor

This example describes the Condor interface for the ping example supplied with CL-MW.

There are two Condor submit files, one for a single master process, and one for a static cluster of slave processes. The master will write a resource file and the slaves, using file transfer, will read it and know where to connect. No dynamic adjustment of resource acquisition is done in this example.

The Master Process

The following submit file details how the master process is to be run. This file is given to Condor's condor_submit which submits the job into Condor's job queue on the local machine.

# Begin master.sub
universe = scheduler
executable = ./ping
arguments = --mw-master \
--mw-slave-task-group 100 \
--mw-slave-result-group 100 \
--mw-resource-file resource-file.rsc \
--mw-slave-executable ping \
--mw-audit-file master.$(CLUSTER).$(PROCESS).audit \
--mw-member-id $(CLUSTER).$(PROCESS)

output = master.$(CLUSTER).$(PROCESS).out
error = master.$(CLUSTER).$(PROCESS).err
log = master.$(CLUSTER).$(PROCESS).log

notification = NEVER

queue 1
# End master.sub

Each line will be described as to its effect on the job submission.

universe = scheduler

The job is marked to be a scheduler universe job. It will start immediately and only on the machine where this job is submitted.

executable = ./ping

The executable Condor will use when executing this job. Condor and the ping executable will assume the needed libraries, if any, associated with the ping program are present in the same directory as the executable.

arguments = -mw-master \
-mw-slave-task-group 100 \
-mw-slave-result-group 100 \
-mw-resource-file resource-file.rsc \
-mw-slave-executable ping \
-mw-audit-file master.$(CLUSTER).$(PROCESS).audit \
-mw-member-id $(CLUSTER).$(PROCESS)

This is the complete list of command line arguments supplied to the master process when it is executed by Condor.

The meaning of the -mw-* arguments in numerical order are:
  1. Must be first and specifies that this invocation of the ping binary is the master process.
  2. Number of tasks that the master will pack into one network packet to a slave.
  3. Number of results that the slave will pack into one network packet to the master.
  4. Specify the resource file. The resource file path must be unique to each CL-MW computation submitted to the same scheduler daemon in Condor. The path must match between the master's submit file and the slave's submit file.
  5. We explicitly specify the slave executable name. Otherwise, the master would try to determine the name of itself when it is running in order to find its own executable to use as the slave executable in the resource file. The explicit specification of the slave executable is necessary because Condor specifies a different name for the executable when executing it.
  6. An audit file is specified based upon the cluster and process id of the job. It will be filled with information about who and how the slaves connect and what work is given to them. We use Condor's $(CLUSTER) and $(PROCESS) macros, which are unique per job submitted, to assign a unique identifier to this file.
  7. Using Condor's $(CLUSTER) and $(PROCESS) mechanism, we assign a unique identifier to the master. This identifier will be written into the resource file so that the slaves can authenticate themselves to the master.

output = master.$(CLUSTER).$(PROCESS).out

Any standard output written by the master algorithm will be written here.

output = master.$(CLUSTER).$(PROCESS).out

Any standard error output written by the master algorithm will be written here.

log = master.$(CLUSTER).$(PROCESS).log

This file is written by Condor and is a sequential record of a job's lifetime in Condor. A sample of the events which can happen to a job are: submission, execution, termination, held, released, etc. This file is a very useful debugging and tracking tool to find out the state in which a job may be.

notification = NEVER

No matter how this job completes, do not send an email to the account which submitted this job. Valid options are ALWAYS, COMPLETE (the default if notification is not specified), ERROR, and NEVER.

queue 1

This will submit one cluster of jobs into Condor with only one job in the cluster.

When this job is submitted, it should start immediately and create the resource file resource.rsc. After this file is in existence, the slaves can be submitted.

The Slave Processes

The following submit file submits a static cluster of vanilla jobs which are the slaves.

# Begin slaves.sub
universe = vanilla
executable = ./ping
arguments = --mw-slave \
--mw-resource-file resource-file.rsc

output = slaves.$(PROCESS).out
error = slaves.$(PROCESS).err
# All slaves will share a log file.
log = slaves.log

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = libiolib-syscalls.so,resource-file.rsc

notification = NEVER

on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)

queue 1000
# End slaves.sub

We describe the interesting lines in this submit file.

universe = vanilla

This fixates a set of features for this job in the Condor system which state that the job can run on any suitable execute machine in the pool.

arguments = -mw-slave \
-mw-resource-file resource-file.rsc

The meaning of the -mw-* arguments in numerical order are:
  1. Must be first and specifies that this invocation of the ping binary is a slave process.
  2. Specifies a resource file the slave will use to contact the master process and identify itself.

should_transfer_files = YES

This states that Condor is responsible for moving any files the job needs or produces during its execution. If not specified it means there is a common file system between the submit machine and the execute machine that the job can access.

when_to_transfer_output = ON_EXIT

This specifies that Condor only cares about the output a job produces when a job completes.

transfer_input_files = libiolib-syscalls.so,resource-file.rsc

Any input files needed by the job are specified here. We specify the shared libraries the slave executable needs in addition to the resource file. Condor will transfer the most recent versions of these input files each time the job starts.

on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)

This implements the job policy previously mentioned. Only allow the job be removed from the queue when it hasn't exited by a signal and the exit code of the job is zero. If the job exits for any other reason, it will remain in the job queue and be eventually restarted.

queue 1000

Submit one cluster with 1000 jobs in it into Condor.

In summary, one thousand jobs will be submitted into this cluster. The :slaves-needed setting in the resource file is ignored, as there is no overseer watching the resource file and managing resource acquisition on behalf of the master algorithm. The resource file is transmitted as an input file and under Condor this means to transmit it anew every time the job is rerun. Condor also send over any shared libraries or other files the executable needs. These slaves will stay in the queue until they either connect to a master and the master tells it to shut down, or because the master wrote :computation-status :finished into the resource file and the slave reads the file. In addition, we specify the on_exit_remove policy for the job; the slave will only be removed from the queue if the slave had not exited with a signal and the exit code was zero as requested by the master. The output of the slaves will be the standard out of the slave algorithm. Since no audit file is specified, the auditing information will go to the standard out of the process.

Environmental Requirements

Running binaries across a host of machines which differ in OS revisions, physical capabilities, and network will bring to the forefront scalability and binary compatibility problems. Here we describe a few common problems and their solutions.

Memory Requirements

Depending upon the memory capabilities of the resource slots in the pool (suppose they only have 384MB each), a slave may run once, consume 512MB which Condor records as the memory usage for the job, be preempted for some reason, and then never run again because no slot in the pool will accept a 512MB job anymore.

This is fixed by either adding Requirements = Memory > 0 (or whatever fits your needs) to the slave's submit file or adjusting the fixed runtime heap size with command line arguments to SBCL when you create the executable. The latter choice is safer since it models the true resource requirements your application needs and does a better job of preventing thrashing. The former is more useful for testing purposes.

Network or Disk Bandwidth

Another environmental concern is the network or disk bandwidth of the submit machine as potentially thousands of slaves simultaneously have their executables, shared libraries, and other files transferred from the submit machine and onto the execute slots. In practice this often isn't a problem, but it is good to know what to do if it becomes one.

The condor_schedd config file entries JOB_START_DELAY and
JOB_START_COUNT can be used to limit the job start rate to restrict network and disk bandwidth when bursts of jobs begin running.

Dynamic Linking

Since a CL-MW application is a dynamically linked binary, it will need to find and load its required libraries at run time. When the binary is moved from the submit host to the execute host, the execute host may not have a required dynamic library available, or more rarely, a required kernel syscall interface the job needs. In this event, the job (in our case, the slave) will (often) die with a SIGKILL and go back to idle (due to our on_exit_remove policy). It could be possible for a slave to continuously match to a machine upon which it cannot run. In this situation the slave will accumulate runtime but make no forward progress. The preferred solution is to package your libraries with your job. In the rare cases where this will not be sufficient, you may have to restrict the set of machine upon which your job runs. Please read section 2.5 in the Condor manual for how to specify this kind of a requirement for your job.

Peter Keller 2012-03-27