Configuring LSF with SGI Cpusets

Automatic configuration at installation and upgrade

During installation and upgrade, lsfinstall adds the schmod_cpuset external scheduler plugin module name to the PluginModule section of lsb.modules:

Begin PluginModule
SCH_PLUGIN           RB_PLUGIN       SCH_DISABLE_PHASES 
schmod_default          ()                      ()
schmod_cpuset           ()                      ()
End PluginModule

The schmod_cpuset plugin name must be configured after the standard LSF plugin names in the PluginModule list. For upgrade, lsfinstall comments out the schmod_topology external scheduler plugin name in the PluginModule section of lsb.modules.

During installation and upgrade, lsfinstall sets the following parameters in lsf.conf:

  • LSF_ENABLE_EXTSCHEDULER=Y: LSF uses an external scheduler for cpuset allocation.

  • LSB_CPUSET_BESTCPUS=Y: LSF schedules jobs based on the shortest CPU radius in the processor topology using a best-fit algorithm for cpuset allocation.

  • LSB_SHORT_HOSTLIST=1: Displays an abbreviated list of hosts in bjobs and bhist for a parallel job where multiple processes of a job are running on a host. Multiple processes are displayed in the following format:

    processes*hostA

For upgrade, lsfinstall comments out the following obsolete parameters in lsf.conf, and sets the corresponding RLA configuration:

  • LSF_TOPD_PORT=port_number, replaced by LSB_RLA_PORT=port_number, using the same value as LSF_TOPD_PORT. The port_number is the TCP port used for communication between the LSF topology adapter (RLA) and sbatchd. The default port number is 6883.

  • LSF_TOPD_WORKDIR=directory parameter, replaced by LSB_RLA_WORKDIR=directory parameter, using the same value as LSF_TOPD_WORKDIR. The directory is the location of the status files for RLA, which allows RLA to recover its original state when it restarts. When RLA first starts, it creates the directory defined by LSB_RLA_WORKDIR if it does not exist, then creates subdirectories for each host.

During installation and upgrade, lsfinstall defines the cpuset Boolean resource in lsf.shared:

Begin Resource
RESOURCENAME   TYPE     INTERVAL  INCREASING   DESCRIPTION
...
cpuset         Boolean  ()        ()           (cpuset host)
...
End Resource

You should add the cpuset resource name under the RESOURCES column of the Host section of lsf.cluster.cluster_name. Hosts without the cpuset resource specified are not considered for scheduling cpuset jobs. For each cpuset host, hostsetup adds the cpuset Boolean resource to the HOST section of lsf.cluster.cluster_name.

Optional configuration

When configuring lsb.queues:

  • MANDATORY_EXTSCHED=CPUSET[cpuset_options] sets required cpuset properties for the queue. MANDATORY_EXTSCHED options override -extsched options used at job submission.

  • DEFAULT_EXTSCHED=CPUSET[cpuset_options] Sets default cpuset properties for the queue if the -extsched option is not used at job submission. -extsched options override the options set in DEFAULT_EXTSCHED.

  • In some pre-defined LSF queues, such as normal, the default MEMLIMIT is set to 5000 (5 MB). However, if ULDB is enabled (LSF_ULDB_DOMAIN is defined), the MEMLIMIT should be set greater than 8000.

When configuring lsf.conf:

  • LSB_RLA_UPDATE=seconds specifies how often the LSF scheduler refreshes cpuset information from RLA. The default is 600 seconds.

  • LSB_RLA_WORKDIR=directory specifies the directory where the status files for RLA are located. This allows RLA to recover its original state when it restarts. When RLA first starts, it creates the directory defined by LSB_RLA_WORKDIR if it does not exist, then creates subdirectories for each host.

    Avoid using /tmp or any other directory that is automatically cleaned up by the system. Unless your installation has restrictions on the LSB_SHAREDIR directory, you should use the default:

    LSB_SHAREDIR/cluster_name/rla_workdir

    Do not use a CXFS file system for LSB_RLA_WORKDIR.

  • LSF_PIM_SLEEPTIME_UPDATE=Y: This parameter reduces communication traffic between sbatchd and PIM on the same host. When this parameter is defined:

    • sbatchd does not query PIM immediately as it needs information; it will only query PIM every LSF_PIM_SLEEPTIME seconds.

    • sbatchd may be intermittently unable to retrieve process information for jobs whose run time is smaller than LSF_PIM_SLEEPTIME.

    • It may take longer to view resource usage with bjobs -l.

By default, Linux sets the maximum file descriptor limit to 1024. This value is too small for jobs using more than 200 processes. To avoid MPI job failure, specify a larger file descriptor limit. For example:

# /etc/init.d/lsf sto
# ulimit -n 16384
# /etc/init.d/lsf start

Any host with more than 200 CPUs should start the LSF daemons with the larger file descriptor limit.

Resources for dynamic and static cpusets

If your environment uses both static and dynamic cpusets or you have more than one static cpuset configured, you must configure decreasing numeric resources to represent the cpuset count, and use -R "rusage" in job submission. This allows preemption, and also lets you control number of jobs running on static and dynamic cpusets or on each static cpuset.

To configure cpuset resources:

  1. Edit lsf.shared and configure resources for cpusets and configure resources for static cpusets and non-static cpusets. For example:

    Begin Resource
    RESOURCENAME  TYPE  INTERVAL  INCREASING  DESCRIPTION  #  Keywords
    ...
    dcpus         Numeric ()       N
    scpus         Numeric ()       N
    End Resource

    Where:

    • dcpus is the number CPUs outside static cpusets (that is the total number of CPUs minus the number of CPUs in static cpusets).

    • scpus is the number of CPUs in static cpusets. For static cpusets, configure a separate resource for each static cpuset. You should use the cpuset name as the resource name.

    The names dcpus and scpus can be any name.

  2. Edit lsf.cluster.cluster_name to map the resources to hosts. For example:
    Begin ResourceMap
    RESOURCENAME LOCATION
    dcpus (4@[hosta]) # total cpus - cpus in static cpusets
    scpus (8@[hostc]) # static cpusets
    End ResourceMap

    For dynamic cpuset resources, the value of the resource should be the number of free CPUs on the host; that is, the number of CPUs outside of any static cpusets on the host.

    For static cpuset resources, the number of the resource should be the number of CPUs in the static cpuset.

  3. Edit lsb.params and configure your cpuset resources as preemptable. For example:

    Begin Parameters
    ...
    PREEMPTABLE_RESOURCES = scpus dcpus
    End Parameters
  4. Edit lsb.hosts and set MXJ greater than or equal to the total number of CPUs in static and dynamic cpusets for which you have configured resources.

Use the following commands to verify your configuration:

bhosts -s
RESOURCE   TOTAL   RESERVED   LOCATION
dcpus      4.0      0.0       hostA
scpus      8.0      0.0       hostA
 
lshosts -s
RESOURCE   VALUE   LOCATION
dcpus      4       hostA
scpus      8       hostA
 
bhosts
HOST_NAME  STATUS  JL/U  MAX  NJOBS  RUN  SSUSP  USUSP  RSV
hostA      ok       -     -     1     1    0       0     0

To submit jobs, use -R "rusage" in job submission. This allows preemption, and also lets you control the number of jobs running on static and dynamic cpusets or on each static cpuset.

Configuring default cpuset options

Use the DEFAULT_EXTSCHED queue parameter in lsb.queues to configure default cpuset options. Use the keyword CPUSET[] to identify the external scheduler parameters.

DEFAULT_EXTSCHED=[SGI_]CPUSET[cpuset_options] specifies default cpuset external scheduling options for the queue. -extsched options on the bsub command are merged with DEFAULT_EXTSCHED options, and -extsched options override any conflicting queue-level options set by DEFAULT_EXTSCHED.

For example, if the queue specifies:

DEFAULT_EXTSCHED=CPUSET[CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]

and a job is submitted with:

-extsched "CPUSET[CPUSET_TYPE=dynamic;CPU_LIST=1,5,7-12;

CPUSET_OPTIONS=CPUSET_MEMORY_LOCAL]"

LSF uses the resulting external scheduler options for scheduling:

CPUSET[CPUSET_TYPE=dynamic;CPU_LIST=1, 5, 7-12;

CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE CPUSET_MEMORY_LOCAL]

DEFAULT_EXTSCHED can be used in combination with MANDATORY_EXTSCHED in the same queue. For example, if the job specifies:

-extsched "CPUSET[CPU_LIST=1,5,7-12;MAX_CPU_PER_NODE=4]"

and the queue specifies:

Begin Queue
...
DEFAULT_EXTSCHED=CPUSET[CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]
MANDATORY_EXTSCHED=CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2]
...
End Queue

LSF uses the resulting external scheduler options for scheduling:

CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2;CPU_LIST=1, 5,

7-12;CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]

If cpuset options are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -extschedoption of bsub. For example, if DEFAULT_EXTSCHED=CPUSET[MAX_RADIUS=2], and you do not want to specify any radius option at all, use -extsched "CPUSET[MAX_RADIUS=]".

Configuring mandatory cpuset options

Use the MANDATORY_EXTSCHED queue parameter in lsb.queues to configure mandatory cpuset options. Use the keyword CPUSET[] to identify the external scheduler parameters.

-extsched options on the bsub command are merged with MANDATORY_EXTSCHED options, and MANDATORY_EXTSCHED options override any conflicting job-level options set by -extsched.

For example, if the queue specifies:

MANDATORY_EXTSCHED=CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2]

and a job is submitted with:

-extsched "CPUSET[MAX_CPU_PER_NODE=4;CPU_LIST=1,5,7-12;]"

LSF uses the resulting external scheduler options for scheduling:

CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2;CPU_LIST=1, 5, 7-12]

MANDATORY_EXTSCHED can be used in combination with DEFAULT_EXTSCHED in the same queue. For example, if the job specifies:

-extsched "CPUSET[CPU_LIST=1,5,7-12;MAX_CPU_PER_NODE=4]"

and the queue specifies:

Begin Queue
...
DEFAULT_EXTSCHED=CPUSET[CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]
MANDATORY_EXTSCHED=CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2]
...
End Queue

LSF uses the resulting external scheduler options for scheduling:

CPUSET[CPUSET_TYPE=dynamic;MAX_CPU_PER_NODE=2;CPU_LIST=1, 5,

7-12;CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]

If you want to prevent users from setting certain cpuset options in the -extsched option of bsub, use the keyword with no value. For example, if the job is submitted with -extsched "CPUSET[MAX_RADIUS=2]", use MANDATORY_EXTSCHED=CPUSET[MAX_RADIUS=] to override this setting.

Priority of topology scheduling options

The options set by -extsched can be combined with the queue-level MANDATORY_EXTSCHED or DEFAULT_EXTSCHED parameters. If -extsched and MANDATORY_EXTSCHED set the same option, the MANDATORY_EXTSCHED setting is used. If -extsched and DEFAULT_EXTSCHED set the same options, the -extsched setting is used.

Topology scheduling options are applied in the following priority order of level from highest to lowest:

  1. Queue-level MANDATORY_EXTSCHED options override ...

  2. Job level -ext options, which override ...

  3. Queue-level DEFAULT_EXTSCHED options

For example, if the queue specifies:

DEFAULT_EXTSCHED=CPUSET[MAX_CPU_PER_NODE=2]

and the job is submitted with:

bsub -n 4 -ext "CPUSET[MAX_CPU_PER_NODE=1]" myjob

The cpuset option in the job submission overrides the DEFAULT_EXTSCHED, so the job will run in a cpuset allocated with a maximum of 1 CPU per node, honoring the job- level MAX_CPU_PER_NODE option.

If the queue specifies:

MANDATORY_EXTSCHED=CPUSET[MAX_CPU_PER_NODE=2]

and the job is submitted with:

bsub -n 4 -ext "CPUSET[MAX_CPU_PER_NODE=1]" myjob

The job will run in a cpuset allocated with a maximum of two CPUs per node, honoring the MAX_CPU_PER_NODE option in the queue.