FLEXlm Monitor Server

This page contains software to help determine if a program exited because it was denied a license. There's a long introduction, and then the actual software.

Introduction

Many jobs require more than just a CPU to execute successfully. Some examples ofresources include a CPU, network bandwidth, and storage space. An important part, particularly for commercial users, is a software license. In some spaces, the cost of the license can be significantly more than the cost of the hardware to run the software on. Most software license management is done with a package called FLEXlm from Macrovision. FLEXlm allows programs to contact a license server and verify that the number of copies in use in an organization has not exceeded some count.

Using a resource in Condor is a two-step process. First, a job "matches" with a resource. The job turns this match into a concrete execute with a "claiming" step. For fault-tolerance, the resource may be matched multiple times, but the claiming process guarantees that only one job will succeed in executing on a resource.

FLEXlm does not share this notion of matching and claiming. Matching is implicit, and claiming is done automatically by the program. Care must therefore be taken to not run more jobs than available resources. Two possible ways to do this in Condor today is to limit the number of jobs a schedd will run with the MAX_JOBS_RUNNING configuration setting. Another approach is to use custom attributes in the condor_startd to "assign" a license to some set of machines in a pool. Obviously both of these solutions are not ideal, and the Condor Team is working on better solutions, but that is not to focus of this page.

It will be difficult to guarantee that Condor will never attempt to run more jobs than available licenses. For one, Condor may not be in charge of managing all of the licenses, and because there is no claiming of license except at job start, there may be a window of vulnerability between when Condor schedules a resource and the job starts running that another user "steals" the license Condor intended for the job. Additionally, in the trade-off between performance and fault-tolerance, different parts of the system may have different ideas about the number of available licenses. In any distributed systems, the mantra must be "be prepared to deal with failures", and so we must be prepared for a job that we expected to be able to run with a license to instead fail. When we detect that we have failed, we must reschedule the job and try again later.

Unfortunately, the only information we can get from the job is the exit code. We must be able to distinguish between a successful exit and an exit because of a failure. Tradition holds that a successful run of a job exits with status 0 and an unsuccessful run exits with non-zero, but in reality the application itself fails or succeeds in an application specific way. The error scope is much more complicated, and zero versus non-zero is not useful. To determine success, Condor must either try and understand the applications exit status or elicit from other sources that the job was successfully able to check out a license.

Condor gives the user several tools to determine the if job has truly completed or if it should be run again. If the job fails with a well-known exit code when it cannot get a license, the on_exit_remove expression can be used to put the job back into the queue when that exit code is seen, and to leave the queue when the job exits with something else. If the proper exit code is not known, other characteristics of the job may be used. For example, on_exit_remove can cause any job that runs for less than five minutes to requeue, because no successful run of the job will complete in less than five minutes. Another source of information could be to parse the output of the job and look for status information. This was the approach initially taken by SGE users who were using a commercial application that used the same exit code for all errors. This technique works in Condor as well, for example in the POST script of a DAGMan job.

If the exit status and output are not sufficient to determine license failure, there is another source of information: the FLEXlm servers logs. We have written a simple network server that follows the log of the FLEXlm server and keeps track of license failures. After a job completes, a script can contact the FLEXlm monitor and ask "were there any licenses denied to the host the job ran on in the time period between the start of the job and the exit of the job?" The script must use this information conservatively, because FLEXlm does not have enough information to distinguish between denying a license to the exact job if the execute machine is able to run more than one job at a time, but in the case of no denied licenses on the execute machine Condor can be reasonably certain that the job was able to get any license it needed, and the failure is not because the license was denied.


Software

The FLEXlm monitor is meant to be run on the same machine as your FLEXlm server. It's in two parts: A perl server and a config file

It needs two Perl modules that aren't normally in a Perl installation.

File::Tail
Date::Calc
The monitor reads in the FLEXlm logfile and plays back the journal to see what has happened with licenses. Then it opens up a network connection and listens for commands.

It can currently only manage one product in one logfile. That might change someday, but for now, you'd have to run different monitors on different ports (though they could all share the same FLEXlm logfile)

To check if a license has failed, use this client. It takes a couple of arguments:

--host      # What host is running the FLEXlm monitor?
--port    # What port is the FLEXlm monitor listening on?
--starttime 

A couple of notes:

  • The times to starttime and endtime are in UNIX epoch seconds
  • user is user@host - it's the value that the FLEXlm library inside the application you're running constructs, which seems easiest to get at by calling hostname to find out what machine you're on, and using id to get your username (or, you could call something like getpwuid(geteuid()); )

    To use it, wrap your job in a script like so: (also available here

    #!/usr/bin/perl
    
    my $starttime = time;
    
    
    # actually run the job. 
    # in this example, our job is a FLEXlm-wrapped verison of
    # perl (because it was easy to use SAMwrap to create it, and
    # by having it be a perl script we can have the executable do anything
    # we want
    `/home/epaulson/license/bin/perl /home/epaulson/license/bin/foo.pl`;
    
    # FLEXlm will only log about once every 20 seconds or so - give the
    # license server "long enough" to write it's state out to disk
    sleep(30);
    
    my $now = time;
    chmod 0755, "client";
    my $denied = `./client --host flexlm.cs.wisc.edu --starttime $starttime --endtime $now`;
    
    chomp($denied);
    
    if ($denied eq "NO") {
    #This is the successful case, exit with zero 
    exit 0;
    }
    
    exit 1;
    
    

    Submit your job like this (also available here):

    universe = vanilla
    executable = wrapper 
    
    transfer_files = always
    transfer_input_files = client
    
    output = test.out.$(PROCESS)
    log = test.log
    
    notification = never
    
    on_exit_remove = (ExitBySignal==FALSE) && (ExitCode == 0)
    
    queue 100