MAX_SBD_FAIL

Syntax

MAX_SBD_FAIL=integer

Description

The maximum number of retries for reaching a non-responding slave batch daemon, sbatchd.

The minimum interval between retries is defined by MBD_SLEEP_TIME/10. If mbatchd fails to reach a host and has retried MAX_SBD_FAIL times, the host is considered unavailable or unreachable.

After mbatchd tries to reach a host MAX_SBD_FAIL number of times, mbatchd reports the host status as unavailable or unreachable.

When a host becomes unavailable, mbatchd assumes that all jobs running on that host have exited and that all rerunnable jobs (jobs submitted with the bsub -r option) are scheduled to be rerun on another host.

Default

3