You can configure hosts so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.
If you configure host exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs that are dispatched to such “black hole”, or “job-eating” hosts exit abnormally. LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in lsb.hosts).
If EXIT_RATE is specified for the host, LSF invokes eadmin if the job exit rate for a host remains above the configured threshold for longer than 5 minutes. Use JOB_EXIT_RATE_DURATION in lsb.params to change how frequently LSF checks the job exit rate.
Use GLOBAL_EXIT_RATE in lsb.params to set a cluster-wide threshold in minutes for exited jobs. If EXIT_RATE is not specified for the host in lsb.hosts, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.
Specify a threshold for exited jobs. If the job exit rate is exceeded for 5 minutes or the period specified by JOB_EXIT_RATE_DURATION in lsb.params, LSF invokes eadmin to trigger a host exception.
The following Host section defines a job exit rate of 20 jobs for all hosts, and an exit rate of 10 jobs on hostA.
Begin Host
HOST_NAME MXJ EXIT_RATE # Keywords
Default ! 20
hostA ! 10
End Host
By default, LSF checks the number of exited jobs every 5 minutes. Use JOB_EXIT_RATE_DURATION in lsb.params to change this default.
Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.