Handle job exceptions in queues

You can configure queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.

Job exceptions LSF can detect

If you configure job exception handling in your queues, LSF detects the following job exceptions:

  • Job underrun - jobs end too soon (run time is less than expected). Underrun jobs are detected when a job exits abnormally

  • Job overrun - job runs too long (run time is longer than expected). By default, LSF checks for overrun jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for job overrun.

  • Idle job - running job consumes less CPU time than expected (in terms of CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for idle jobs.

Configure job exception handling (lsb.queues)

You can configure your queues to detect job exceptions. Use the following parameters:

JOB_IDLE

Specify a threshold for idle jobs. The value should be a number between 0.0 and 1.0 representing CPU time/runtime. If the job idle factor is less than the specified threshold, LSF invokes eadmin to trigger the action for a job idle exception.

JOB_OVERRUN

Specify a threshold for job overrun. If a job runs longer than the specified run time, LSF invokes eadmin to trigger the action for a job overrun exception.

JOB_UNDERRUN

Specify a threshold for job underrun. If a job exits before the specified number of minutes, LSF invokes eadmin to trigger the action for a job underrun exception.

Example

The following queue defines thresholds for all types job exceptions:

Begin Queue 
... 
JOB_UNDERRUN = 2 
JOB_OVERRUN  = 5 
JOB_IDLE     = 0.10 
... 
End Queue

For this queue:

  • A job underrun exception is triggered for jobs running less than 2 minutes

  • A job overrun exception is triggered for jobs running longer than 5 minutes

  • A job idle exception is triggered for jobs with an idle factor (CPU time/runtime) less than 0.10

Configure thresholds for job exception handling

By default, LSF checks for job exceptions every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for overrun, underrun, and idle jobs.

Tuning

Tip:

Tune EADMIN_TRIGGER_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.