Handle job exceptions

You can configure hosts and queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected and their corresponding actions. By default, LSF does not detect any exceptions.

Run bjobs -d -m host_name to see exited jobs for a particular host.

Job exceptions LSF can detect

If you configure job exception handling in your queues, LSF detects the following job exceptions:

Job underrun - jobs end too soon (run time is less than expected). Underrun jobs are detected when a job exits abnormally
Job overrun - job runs too long (run time is longer than expected). By default, LSF checks for overrun jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for job overrun.
Job estimated run time exceeded— the job’s actual run time has exceeded the estimated run time.
Idle job - running job consumes less CPU time than expected (in terms of CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for idle jobs.

Host exceptions LSF can detect

If you configure host exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such “black hole”, or “job-eating” hosts exit abnormally. By default, LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in lsb.hosts).

If EXIT_RATE is not specified for the host, LSF invokes eadmin if the job exit rate for a host remains above the configured threshold for longer than 5 minutes. Use JOB_EXIT_RATE_DURATION in lsb.params to change how frequently LSF checks the job exit rate.

Use GLOBAL_EXIT_RATE in lsb.params to set a cluster-wide threshold in minutes for exited jobs. If EXIT_RATE is not specified for the host in lsb.hosts, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.

Customize job exception actions with the eadmin script

When an exception is detected, LSF takes appropriate action by running the script LSF_SERVERDIR/eadmin on the master host.

You can customize eadmin to suit the requirements of your site. For example, eadmin could find out the owner of the problem jobs and use bstop -u to stop all jobs that belong to the user.

In some environments, a job running 1 hour would be an overrun job, while this may be a normal job in other environments. If your configuration considers jobs running longer than 1 hour to be overrun jobs, you may want to close the queue when LSF detects a job that has run longer than 1 hour and invokes eadmin.