<H2>FLEXlm Monitor Server</H2>
This page contains software to help determine if a program exited because
it was denied a license. There's a <A href="#intro">long introduction</A>, and 
then the <a href="#software">actual software.</A>
<HR>

<H3><A NAME="intro"></A>Introduction</H3>

Many jobs require more than just a CPU to execute successfully. Some
examples ofresources include a CPU, network bandwidth, and storage
space. 

An important part, particularly for commercial users, is
a software license. 

In some spaces, the cost of the license can be significantly more than
the cost of the hardware to run the software on.  Most software license
management is done with a package called FLEXlm from Macrovision.

FLEXlm allows programs to contact a license server
and verify that the number of copies in use in an organization has not
exceeded some count. 

 <P>

Using a resource in Condor is a two-step process. First, a job "matches" 
with a resource. The job turns this match into a concrete execute with a
"claiming" step. For fault-tolerance, the resource may be matched multiple
times, but the claiming process guarantees that only one job will succeed in
executing on a resource. 

<P>

FLEXlm does not share this notion of matching and claiming. Matching is 
implicit, and claiming is done automatically by the program. Care must
therefore be taken to not run more jobs than available resources. 
Two possible ways to do this in Condor today is to limit the number
of jobs a schedd will run with the MAX_JOBS_RUNNING configuration 
setting. Another approach is to use custom attributes in the condor_startd
to "assign" a license to some set of machines in a pool. Obviously
both of these solutions are not ideal, and the Condor Team is working
on better solutions, but that is not to focus of this page. 

<P>

It will be difficult to guarantee that Condor will never attempt to run
more jobs than available licenses. For one, Condor may not be in charge
of managing all of the licenses, and because there is no claiming of
license except at job start, there may be a window of vulnerability between 
when Condor schedules a resource and the job starts running that another
user "steals" the license Condor intended for the job. Additionally, in
the trade-off between performance and fault-tolerance, different parts of
the system may have different ideas about the number of available licenses.
In any distributed systems, the mantra must be "be prepared to deal with
failures", and so we must be prepared for a job that we expected to be
able to run with a license to instead fail. When we detect that we have failed,
we must reschedule the job and try again later.

<P>

Unfortunately, the only information we can get from the job is the exit code.
We must be able to distinguish between a successful exit and an exit because
of a failure.  Tradition holds that a successful run of a job exits with
status 0 and an unsuccessful run exits with non-zero, but in reality
the application itself fails or succeeds in an application specific way. 
The <a href="http://www.cse.nd.edu/~dthain/papers/error-scope.pdf">error 
scope</A> is much more complicated, and zero versus non-zero is not useful. 
To determine success, Condor must either try and
understand the applications exit status or elicit from other sources that
the job was successfully able to check out a license.

<P>

Condor gives the user several tools to determine the if job has truly completed
or if it should be run again. If the job fails with a well-known exit
code when it cannot get a license, the on_exit_remove expression can be
used to put the job back into the queue when that exit code is seen, and to
leave the queue when the job exits with something else. If the proper exit 
code is not known, other characteristics of the job may be used. For example,
on_exit_remove can cause any job that runs for less than five minutes to 
requeue, because no successful run of the job will complete in less than 
five minutes. Another source of information could be to parse the output
of the job and look for status information. This was the approach initially
taken by <a href="http://www.bio-itworld.com/archive/031704/box.html">SGE
users</A> who were using a commercial application that used the same
exit code for all errors. This technique works in Condor as well, for
example in the POST script of a DAGMan job.

<P>

If the exit status and output are not sufficient to determine license 
failure, there is another source of information: the FLEXlm servers
logs. We have written a simple network server that follows the log
of the FLEXlm server and keeps track of license failures. After a job 
completes, a script can contact the FLEXlm monitor and ask "were there
any licenses denied to the host the job ran on in the time period between
the start of the job and the exit of the job?" The script must use this
information conservatively, because FLEXlm does not have enough information
to distinguish between denying a license to the exact job if the execute machine
is able to run more than one job at a time, but in the case of no denied
licenses on the execute machine Condor can be reasonably certain that the job
was able to get any license it needed, and the failure is not because
the license was denied.
<P>
<HR>
<H3><A NAME="software"></A>Software</H3>
The FLEXlm monitor is meant to be run on the same machine as your FLEXlm
server.  It's in two parts: <a href="flexlm-monitor">A perl server</A> and a <a href="config">config file</A>
<P>
It needs two Perl modules that aren't normally in a Perl installation.

<PRE>
File::Tail
Date::Calc
</PRE>

The monitor reads in the FLEXlm logfile and plays back the journal to
see what has happened with licenses. Then it opens up a network connection
and listens for commands.
<P>
It can currently only manage one product in one logfile. That might 
change someday, but for now, you'd have to run different monitors on different
ports (though they could all share the same FLEXlm logfile)
<P>
To check if a license has failed, use <a href="client">this client</A>. It 
takes a couple of arguments:
<PRE>
--host <hostname>     # What host is running the FLEXlm monitor?
--port <portnumber>   # What port is the FLEXlm monitor listening on?
--starttime <time>    # lower bound on the range, default to 1. 
--endtime <time>      # upper bound on the range, default to current time
--user  <name>        # who@where.some.com are we asking about?
</PRE>
<P>
A couple of notes:
<li>The times to starttime and endtime are in UNIX epoch seconds
<li>user is user@host - it's the value that the FLEXlm library inside the
application you're running constructs, which seems easiest to get at by
calling hostname to find out what machine you're on, and using id to get
your username (or, you could call something like getpwuid(geteuid()); )
<P>
To use it, wrap your job in a script like so: (also available <a href="wrapper">here</A>

<PRE>
#!/usr/bin/perl

my $starttime = time;


# actually run the job. 
# in this example, our job is a FLEXlm-wrapped verison of
# perl (because it was easy to use SAMwrap to create it, and
# by having it be a perl script we can have the executable do anything
# we want
`/home/epaulson/license/bin/perl /home/epaulson/license/bin/foo.pl`;

# FLEXlm will only log about once every 20 seconds or so - give the
# license server "long enough" to write it's state out to disk
sleep(30);

my $now = time;
chmod 0755, "client";
my $denied = `./client --host flexlm.cs.wisc.edu --starttime $starttime --endtime $now`;

chomp($denied);

if ($denied eq "NO") {
#This is the successful case, exit with zero 
exit 0;
}

exit 1;

</PRE>

<P>
Submit your job like this (also available <a href="example.sub">here</A>):
<PRE>
universe = vanilla
executable = wrapper 

transfer_files = always
transfer_input_files = client

output = test.out.$(PROCESS)
log = test.log

notification = never

on_exit_remove = (ExitBySignal==FALSE) && (ExitCode == 0)

queue 100
</PRE>