next up previous contents index
Next: 7.3 Running Condor Jobs Up: 7. Frequently Asked Questions Previous: 7.1 Obtaining & Installing   Contents   Index

Subsections

7.2 Setting up Condor

How do I set up a central manager on a machine with multiple network interfaces?

Please see section 3.7.3 on page [*].

How do I get more than one job to run on my SMP machine?

Condor will automatically recognize a SMP machine and advertise each CPU of the machine separately. For more details, see section 3.13.9 on page [*].

How do I configure a separate policy for the CPUs of an SMP machine?

Please see section 3.13.9 on page [*] for a lengthy discussion on this topic.


How do I set up my machines so that only specific users' jobs will run on them?

Restrictions on what jobs will run on a given resource are enforced by only starting jobs that meet specific constraints, and these constraints are specified as part of the configuration.

To specify that a given machine should only run certain users' jobs, and always run the jobs regardless of other activity on the machine, load average, etc., place the following entry in the machine's Condor configuration file:

START = ( (User == "userfoo@baz.edu") || \
          (User == "userbar@baz.edu") )

A more likely scenario is that the machine is restricted to run only specific users' jobs, contingent on the machine not having other interactive activity and not being heavily loaded. The following entries are in the machine's Condor configuration file. Note that extra configuration variables are defined to make the START variable easier to read.

# Only start jobs if:
# 1) the job is owned by the allowed users, AND
# 2) the keyboard has been idle long enough, AND
# 3) the load average is low enough OR the machine is currently
#    running a Condor job, and would therefore accept running
#    a different one
AllowedUser    = ( (User == "userfoo@baz.edu") || \
                   (User == "userbar@baz.edu") )
KeyboardUnused = (KeyboardIdle > $(StartIdleTime))
NoOwnerLoad    = ($(CPUIdle) || (State != "Unclaimed" && State != "Owner"))
START          = $(AllowedUser) && $(KeyboardUnused) && $(NoOwnerLoad)

To configure multiple machines to do so, create a common configuration file containing this entry for them to share.

How do I configure Condor to run my jobs only on machines that have the right packages installed?

This is a two-step process. First, you need to tell the machines to report that they have special software installed, and second, you need to tell the jobs to require machines that have that software.

To tell the machines to report the presence of special software, first add a parameter to their configuration files like so:

HAS_MY_SOFTWARE = True

And then, if there are already STARTD_ATTRS defined in that file, add HAS_MY_SOFTWARE to them, or, if not, add the line:

STARTD_ATTRS = HAS_MY_SOFTWARE, $(STARTD_ATTRS)

NOTE: For these changes to take effect, each condor_startd you update needs to be reconfigured with condor_reconfig -startd.

Next, to tell your jobs to only run on machines that have this software, add a requirements statement to their submit files like so:

Requirements = (HAS_MY_SOFTWARE =?= True)

NOTE: Be sure to use =?= instead of == so that if a machine doesn't have the HAS_MY_SOFTWARE parameter defined, the job's Requirements expression will not evaluate to ``undefined'', preventing it from running anywhere!


How do I configure Condor to only run jobs at night?

A commonly requested policy for running batch jobs is to only allow them to run at night, or at other pre-specified times of the day. Condor allows you to configure this policy with the use of the ClockMin and ClockDay condor_startd attributes. A complete example of how to use these attributes for this kind of policy is discussed in subsubsection 3.5.9 on page [*].


How do I configure Condor such that all machines do not produce checkpoints at the same time?

If machines are configured to produce checkpoints at fixed intervals, a large number of jobs are queued (submitted) at the same time, and these jobs start on machines at about the same time, then all these jobs will be trying to write out their checkpoints at the same time. It is likely to cause rather poor performance during this burst of writing.

The RANDOM_INTEGER() macro can help in this instance. Instead of defining PERIODIC_CHECKPOINT to be a fixed interval, each machine is configured to randomly choose one of a set of intervals. For example, to set a machine's interval for producing checkpoints to within the range of two to three hours, use the following configuration:

PERIODIC_CHECKPOINT = $(LastCkpt) > ( 2 * $(HOUR) + \
      $RANDOM_INTEGER(0,60,10) * $(MINUTE) )

The interval used is set at configuration time. Each machine is randomly assigned a different interval (2 hours, 2 hours and 10 minutes, 2 hours and 20 minutes, etc.) at which to produce checkpoints. Therefore, the various machines will not all attempt to produce checkpoints at the same time.

Why will the condor_master not run when a local configuration file is missing?

If a LOCAL_CONFIG_FILE is specified in the global configuration file, but the specified file does not exist, the condor_master will not start up, and it prints a variation of the following example message.

ERROR: Can't read config file /mnt/condor/hosts/bagel/condor_config.local

This is not a bug; it is a feature! Condor has always worked this way on purpose. There is a potentially large security hole if Condor is configured to read from a file that does not exist. By creating that file, a malicious user could change all sorts of Condor settings. This is an easy way to gain root access to a machine, where the daemons are running as root.

The intent is that if you've set up your global configuration file to read from a local configuration file, and the local file is not there, then something is wrong. It is better for the condor_master to exit right away and log an error message than to start up.

If the condor_master continued with the local configuration file missing, either A) someone could breach security or B) you will have potentially important configuration information missing. Consider the example where the local configuration file was on an NFS partition and the server was down. There would be all sorts of really important stuff in the local configuration file, and Condor might do bad things if it started without those settings.

If supplied it with an empty file, the condor_master works fine.


next up previous contents index
Next: 7.3 Running Condor Jobs Up: 7. Frequently Asked Questions Previous: 7.1 Obtaining & Installing   Contents   Index
condor-admin@cs.wisc.edu