External load indices behavior

How LSF manages multiple elim executables

The LSF administrator can write one elim executable to collect multiple external load indices, or the LSF administrator can divide external load index collection among multiple elim executables. On each host, the load information manager (LIM) starts a master elim (MELIM), which manages all elim executables on the host and reports the external load index values to the LIM. Specifically, the MELIM
  • Starts elim executables on the host. The LIM checks the ResourceMap section LOCATION settings (default, all, or host list) and directs the MELIM to start elim executables on the corresponding hosts.
    Note:

    If the ResourceMap section contains even one resource mapped as default, and if there are multiple elim executables in LSF_SERVERDIR, the MELIM starts all of the elim executables in LSF_SERVERDIR on all hosts in the cluster. Not all of the elim executables continue to run, however. Those that use a checking header could exit with ELIM_ABORT_VALUE if they are not programmed to report values for the resources listed in LSF_RESOURCES.

  • Restarts an elim if the elim exits. To prevent system-wide problems in case of a fatal error in the elim, the maximum restart frequency is once every 90 seconds. The MELIM does not restart any elim that exits with ELIM_ABORT_VALUE.

  • Collects the load information reported by the elim executables.

  • Checks the syntax of load update strings before sending the information to the LIM.

  • Merges the load reports from each elim and sends the merged load information to the LIM. If there is more than one value reported for a single resource, the MELIM reports the latest value.

  • Logs its activities and data into the log file LSF_LOGDIR/melim.log.host_name

  • Increases system reliability by buffering output from multiple elim executables; failure of one elim does not affect other elim executables running on the same host.

How LSF determines which hosts should run an elim executable

LSF provides configuration options to ensure that your elim executables run only when they can report the resources values expected on a host. This maximizes system performance and simplifies the implementation of external load indices. To control which hosts run elim executables, you
  • Must map external resource names to locations in lsf.cluster.cluster_name

  • Optionally, use the environment variables LSF_RESOURCES, LSF_MASTER, and ELIM_ABORT_VALUE in your elim executables

How resource mapping determines elim hosts

The following table shows how the resource mapping defined in lsf.cluster.cluster_name determines the hosts on which your elim executables start.

If the specified LOCATION is …

Then the elim executables start on …

  • ([all]) | ([all ~host_name])

  • The master host because all hosts in the cluster (except those identified by the not operator [~]) share a single instance of the external resource.

  • [default]

  • Every host in the cluster because the default setting identifies the external resource as host-based.

  • If you use the default keyword for any external resource, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. For information about how to program an elim to exit when it cannot collect information about resources on a host, see How environment variables determine elim hosts.

  • ([host_name]) | ([host_name] [host_name])

  • On the specified hosts.

  • If you specify a set of hosts, the elim executables start on the first host in the list. For example, if the LOCATION in the ResourceMap section of lsf.cluster.cluster_name is ([hostA hostB hostC] [hostD hostE hostF]):
    • LSF starts the elim executables on hostA and hostD to report values for the resources shared by that set of hosts.

    • If the host reporting the external load index values becomes unavailable, LSF starts the elim executables on the next available host in the list. In this example, if hostA becomes unavailable, LSF starts the elim executables on hostB.

    • If hostA becomes available again, LSF starts the elim executables on hostA and shuts down the elim executables on hostB.

How environment variables determine elim hosts

If you use the default keyword for any external resource in lsf.cluster.cluster_name, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. You can control the hosts on which your elim executables run by using the environment variables LSF_MASTER, LSF_RESOURCES, and ELIM_ABORT_VALUE. These environment variables provide a way to ensure that elim executables run only when they are programmed to report the values for resources expected on a host.

  • LSF_MASTER—You can program your elim to check the value of the LSF_MASTER environment variable. The value is Y on the master host and N on all other hosts. An elim executable can use this parameter to check the host on which the elim is currently running.

  • LSF_RESOURCES—When the LIM starts an MELIM on a host, the LIM checks the resource mapping defined in the ResourceMap section of lsf.cluster.cluster_name. Based on the mapping location (default, all, or a host list), the LIM sets LSF_RESOURCES to the list of resources expected on the host.

    When the location of the resource is defined as default, the resource is listed in LSF_RESOURCES on the server hosts. When the location of the resource is defined as all, the resource is only listed in LSF_RESOURCES on the master host.

    Use LSF_RESOURCES in a checking header to verify that an elim is programmed to collect values for at least one of the resources listed in LSF_RESOURCES.

  • ELIM_ABORT_VALUE—An elim should exit with ELIM_ABORT_VALUE if the elim is not programmed to collect values for at least one of the resources listed in LSF_RESOURCES. The MELIM does not restart an elim that exits with ELIM_ABORT_VALUE. The default value is 97.

The following sample code shows how to use a header to verify that an elim is programmed to collect load indices for the resources expected on the host. If the elim is not programmed to report on the requested resources, the elim does not need to run on the host.
#!/bin/sh 
# list the resources that the elim can report to lim 
my_resource="myrsc" 
# do the check when $LSF_RESOURCES is defined by lim 
if [ -n "$LSF_RESOURCES" ]; then 
# check if the resources elim can report are listed in $LSF_RESOURCES 
res_ok=`echo " $LSF_RESOURCES " | /bin/grep " $my_resource " ` 
# exit with $ELIM_ABORT_VALUE if the elim cannot report on at least
# one resource listed in $LSF_RESOURCES
    if [ "$res_ok" = "" ] ; then
        exit $ELIM_ABORT_VALUE
    fi
 fi
while [ 1 ];do 
# set the value for resource "myrsc" 
val="1" 
# create an output string in the format: 
# number_indices index1_name index1_value... 
reportStr="1 $my_resource $val"
 echo "$reportStr"
# wait for 30 seconds before reporting again
sleep 30
done