You can configure queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.
If you configure job exception handling in your queues, LSF detects the following job exceptions:
Job underrun - jobs end too soon (run time is less than expected). Underrun jobs are detected when a job exits abnormally
Job overrun - job runs too long (run time is longer than expected). By default, LSF checks for overrun jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for job overrun.
Idle job - running job consumes less CPU time than expected (in terms of CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for idle jobs.
You can configure your queues to detect job exceptions. Use the following parameters:
Specify a threshold for idle jobs. The value should be a number between 0.0 and 1.0 representing CPU time/runtime. If the job idle factor is less than the specified threshold, LSF invokes eadmin to trigger the action for a job idle exception.
Specify a threshold for job overrun. If a job runs longer than the specified run time, LSF invokes eadmin to trigger the action for a job overrun exception.
Specify a threshold for job underrun. If a job exits before the specified number of minutes, LSF invokes eadmin to trigger the action for a job underrun exception.
The following queue defines thresholds for all types job exceptions:
Begin Queue
...
JOB_UNDERRUN = 2
JOB_OVERRUN = 5
JOB_IDLE = 0.10
...
End Queue
For this queue:
A job underrun exception is triggered for jobs running less than 2 minutes
A job overrun exception is triggered for jobs running longer than 5 minutes
A job idle exception is triggered for jobs with an idle factor (CPU time/runtime) less than 0.10
By default, LSF checks for job exceptions every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for overrun, underrun, and idle jobs.
Tune EADMIN_TRIGGER_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.