Condor Configuration

Move EXECUTE to /export/scratch/condor/execute
- More space for sandbox directories that jobs will run in
- Note Jobs wont operate with this change unless the EXECUTE directory has the correct permissions (755 a.k.a. drwxr-xr-x)
- the rest of the files condor reads/writes are kept in /var which is large enough if we keep the root partition O(100GB)
Separate roles
- head node is only the head node (condor: use ROLE : CentralManager)
- Login/submit node is only login/submit node (condor: use ROLE : Submit)
- worker nodes are only worker nodes (condor: use ROLE : Execute)
Default allocation keeps 1 cpu per slot
- leave this as is since all of our multi-threaded apps are moved to other clusters like MSI
Pretend the data storage space (/local/cms/) is not mounted to worker nodes
- allows condor to handle file transfer to scratch space
- centrally managed file transfer means the system can slow down jobs if the storage node is getting too many read calls (aka "throttling")
Increase job throttling limit FILE_TRANSFER_DISK_LOAD_THROTTLE = 5.0
- Default FILE_TRANSFER_DISK_LOAD_THROTTLE is 2.0 which is pretty conservative, we tried a value of 10.0 but that seemed to cause the submit node to lose connection with the NFS host so we settled on the moderate value of 5.0.
Set JOB_START_DELAY to 5
- This makes sure no two jobs start within 5s of each other. This staggering is helpful to avoid overwhelming the shared data storage location.
Maximum number of jobs kept in any single schedulers history is 10000
- This is the default set by HTCondor and means we should keep our clusters of jobs to less than 10k so we can inspect them via the history.

UMN SPA Cluster

Condor Configuration