LSF_HPC_EXTENSIONS="extension_name ..."
Enables LSF HPC extensions.
After adding or changing LSF_HPC_EXTENSIONS, use badmin mbdrestart and badmin hrestart to reconfigure your cluster.
The following extension names are supported:
CUMULATIVE_RUSAGE: When a parallel job script runs multiple commands, resource usage is collected for jobs in the job script, rather than being overwritten when each command is executed.
DISP_RES_USAGE_LIMITS: bjobs displays resource usage limits configured in the queue as well as job-level limits.
HOST_RUSAGE: For parallel jobs, reports the correct rusage based on each host’s usage and the total rusage being charged to the execution host. This host rusage breakdown applies to the blaunch framework, the pam framework, and vendor MPI jobs. For a running job, you will see run time, memory, swap, utime, stime, and pids and pgids on all hosts that a parallel job spans. For finished jobs, you will see memory, swap, utime, and stime on all hosts that a parallel job spans. The host-based rusage is reported in the JOB_FINISH record of lsb.acct and lsb.stream, and the JOB_STATUS record of lsb.events if the job status is done or exit. Also for finished jobs, bjobs -l shows CPU time, bhist -l shows CPU time, and bacct -l shows utime, stime, memory, and swap. In the MultiCluster lease model, the parallel job must run on hosts that are all in the same cluster. If you use the jobFinishLog API, all external tools must use jobFinishLog built with LSF 9.1 or later, or host-based rusage will not work. If you add or remove this extension, you must restart mbatchd, sbatchd, and res on all hosts. The behaviour used to be controlled by HOST_RUSAGE prior to LSF 9.1.
NO_HOST_RUSAGE: Turn on this parameter if you do not want to see host-based job resource usage details. However, mbatchd will continue to report job rusage with bjobs for all running jobs even if you configure NO_HOST_RUSAGE and restart all the daemons.
LSB_HCLOSE_BY_RES: If res is down, host is closed with a message
Host is closed because RES is not available.
The status of the closed host is closed_Adm. No new jobs are dispatched to this host, but currently running jobs are not suspended.
RESERVE_BY_STARTTIME: LSF selects the reservation that gives the job the earliest predicted start time.
By default, if multiple host groups are available for reservation, LSF chooses the largest possible reservation based on number of slots.
number_of_hosts*real_host_name
When SHORT_EVENTFILE is enabled, older daemons and commands (pre-LSF Version 7) cannot recognize the lsb.acct and lsb.events file format.
6 "hostA" "hostA" "hostA" "hostA" "hostB" "hostC"
3 "4*hostA" "hostB" "hostC"
When LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is set, and LSF reads the host list from lsb.events or lsb.acct, the compressed host list is expanded into a normal host list.
numExHosts (%d)
execHosts (%s)
numExHosts (%d)
execHosts (%s)
numReserHosts (%d)
reserHosts (%s)
numExHosts (%d)
execHosts (%s)
SHORT_PIDLIST: Shortens the output from bjobs to omit all but the first process ID (PID) for a job. bjobs displays only the first ID and a count of the process group IDs (PGIDs) and process IDs for the job.
Without SHORT_PIDLIST, bjobs -l displays all the PGIDs and PIDs for the job. With SHORT_PIDLIST set, bjobs -l displays a count of the PGIDS and PIDs.
TASK_MEMLIMIT: Enables enforcement of a memory limit (bsub -M, bmod -M, or MEMLIMIT in lsb.queues) for individual tasks in a parallel job. If any parallel task exceeds the memory limit, LSF terminates the entire job.
TASK_SWAPLIMIT: Enables enforcement of a virtual memory (swap) limit (bsub -v, bmod -v, or SWAPLIMIT in lsb.queues) for individual tasks in a parallel job. If any parallel task exceeds the swap limit, LSF terminates the entire job.
bsub -n 64 -R "span[ptile=32]" sleep 100
"JOB_START" "9.1.2" 1058989891 710 4 0 0 10.3 64 "hostA" "hostA" "hostA" "hostA" "hostA"
"hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA"
"hostA" "hostA" "hostA" "hostA" "hostA" "u050" "hostA" "hostA" "hostA" "hostA" "hostA"
"hostA" "hostA" "hostA" "hostA" "hostA" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB"
"hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB"
"hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB"
"hostB" "hostB" "hostB" "hostB" "" "" 0 "" 0
"JOB_START" "9.1.2" 1058998174 812 4 0 0 10.3 2 "32*hostA" "32*hostB" "" "" 0 "" 0 ""
bsub -n 64 -R "span[ptile=32]" sleep 100
"JOB_FINISH" "9.1.2" 1058990001 710 33054 33816578 64 1058989880 0 0 1058989891 "user1"
"normal" "span[ptile=32]" "" "" "hostA" "/scratch/user1/work" "" "" "" "1058989880.710"
0 64 "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA"
"hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA"
"hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA" "hostA"
"hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB"
"hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB"
"hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" "hostB" 64 10.3
"" "sleep 100" 0.079999 0.270000 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 "" "default" 0 64
"" "" 0 4304 6024 "" "" "" 0
"JOB_FINISH" "9.1.2" 1058998282 812 33054 33816578 64 1058998163 0 0 1058998174 "user1"
"normal" "span[ptile=32]" "" "" "hostA" "/scratch/user1/work" "" "" "" "1058998163.812"
0 2 "32*hostA" "32*hostB" 64 10.3 "" "sleep 100" 0.039999 0.259999 0 0 -1 0 0 0 0 0 0 0
-1 0 0 0 0 0 -1 "" "default" 0 64 "" "" 0 4304 6024 "" "" "" "" 0 0
bjobs -l
Job <109>, User <user3>, Project <default>, Status <RUN>, Queue <normal>, Inte
ractive mode, Command <./myjob.sh>
Mon Jul 21 20:54:44 2009: Submitted from host <hostA>, CWD <$HOME/LSF/jobs;
RUNLIMIT
10.0 min of hostA
STACKLIMIT CORELIMIT MEMLIMIT
5256 K 10000 K 5000 K
Mon Jul 21 20:54:51 2009: Started on <hostA>;
Mon Jul 21 20:55:03 2009: Resource usage collected.
MEM: 2 Mbytes; SWAP: 15 Mbytes
PGID: 256871; PIDs: 256871
PGID: 257325; PIDs: 257325 257500 257482 257501 257523
257525 257531
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
cpuspeed bandwidth
loadSched - -
loadStop - -
<< Job <109> is done successfully. >>
bjobs -l
Job <109>, User <user3>, Project <default>, Status <RUN>, Queue <normal>, Inte
ractive mode, Command <./myjob.sh>
Mon Jul 21 20:54:44 2009: Submitted from host <hostA>, CWD <$HOME/LSF/jobs;
RUNLIMIT
10.0 min of hostA
STACKLIMIT CORELIMIT MEMLIMIT
5256 K 10000 K 5000 K
Mon Jul 21 20:54:51 2009: Started on <hostA>;
Mon Jul 21 20:55:03 2009: Resource usage collected.
MEM: 2 Mbytes; SWAP: 15 Mbytes
PGID(s): 256871:1 PID, 257325:7 PIDs
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
cpuspeed bandwidth
loadSched - -
loadStop - -
Set to "CUMULATIVE_RUSAGE HOST_RUSAGE LSB_HCLOSE_BY_RES
SHORT_EVENTFILE" at time of installation for the PARALLEL configuration template. If
otherwise undefined, then "HOST_RUSAGE".