These are my personal notes and do not represent anything official.
Working with Paolo Desiati at UW-Physics.
A typical run is 100,000 events taking about 1.5 CPU hours.
Results from intermediate steps are not needed and are deleted when the next step is done with them.
Working on Intel/Linux machines. Using ~100 CPU PBS cluster at Physics. Trying to use CS HTCondor pool.
Originally make each step a seperate HTCondor jobs. This swamped the submit box with IO on the big files. Now using a single perl script running entire stream. The script take a static 15MB tarball (expands to 500MB) and a small amount of job specific information.
Want to change perl script to recognize if step 2 (Corama) is done. If so, start with step 3. This, coupled with transfer_files=always means that if a job is evicted, it may be able to recover some work. The current plan is to have the perl script first move the file into final position, then create a "step 2 done" marker file.
Steps in process: