Logging and troubleshooting

LSF log files

LSF event and account log location

LSF uses directories for temporary work files, log files and transaction files and spooling.

LSF keeps track of all jobs in the system by maintaining a transaction log in the work subtree. The LSF log files are found in the directory LSB_SHAREDIR/cluster_name/logdir.

The following files maintain the state of the LSF system:

lsb.events: LSF uses the lsb.events file to keep track of the state of all jobs. Each job is a transaction from job submission to job completion. LSF system keeps track of everything associated with the job in the lsb.events file.
lsb.events.n: The events file is automatically trimmed and old job events are stored in lsb.event.n files. When mbatchd starts, it refers only to the lsb.events file, not the lsb.events.n files. The bhist command can refer to these files.

LSF error log location

If the optional LSF_LOGDIR parameter is defined in lsf.conf, error messages from LSF servers are logged to files in this directory.

If LSF_LOGDIR is defined, but the daemons cannot write to files there, the error log files are created in /tmp.

If LSF_LOGDIR is not defined, errors are logged to the system error logs (syslog) using the LOG_DAEMON facility. syslog messages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file /etc/syslog.conf, and read the man pages for syslog(3) and syslogd(1).

If the error log is managed by syslog, it is probably already being automatically cleared.

If LSF daemons cannot find lsf.conf when they start, they will not find the definition of LSF_LOGDIR. In this case, error messages go to syslog. If you cannot find any error messages in the log files, they are likely in the syslog.

LSF daemon error logs

LSF log files are reopened each time a message is logged, so if you rename or remove a daemon log file, the daemons will automatically create a new log file.

The LSF daemons log messages when they detect problems or unusual situations.

The daemons can be configured to put these messages into files.

The error log file names for the LSF system daemons are:

res.log.host_name
sbatchd.log.host_name
mbatchd.log.host_name
mbschd.log.host_name

LSF daemons log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. Message logging for LSF daemons is controlled by the parameter LSF_LOG_MASK in lsf.conf. Possible values for this parameter can be any log priority symbol that is defined in /usr/include/sys/syslog.h. The default value for LSF_LOG_MASK is LOG_WARNING.

LSF log directory permissions and ownership

Ensure that the permissions on the LSF_LOGDIR directory to be writable by root. The LSF administrator must own LSF_LOGDIR.

EGO log files

Log files contain important run-time information about the general health of EGO daemons, workload submissions, and other EGO system events. Log files are an essential troubleshooting tool during production and testing.

The naming convention for most EGO log files is the name of the daemon plus the host name the daemon is running on.

The following table outlines the daemons and their associated log file names. Log files on Windows hosts have a .txt extension.

Daemon	Log file name
ESC (EGO Service Controller)	esc.log.`hostname`
named	named.log.`hostname`
PEM (Process Execution Manager)	pem.log.`hostname`
VEMKD (Platform LSF Kernel Daemon)	vemkd.log.`hostname`
WSG (Web Service Gateway)	wsg.log

Most log entries are informational in nature. It is not uncommon to have a large (and growing) log file and still have a healthy cluster.

EGO log file locations

By default, most Platform LSF log files are found in LSF_LOGDIR .

The service controller log files are found in LSF_LOGDIR/ego/cluster_name/eservice/esc/log (Linux) or LSF_LOGDIR\ego\cluster_name\eservice\esc\log (Windows).
Web service gateway log files are found in

LSF_LOGDIR/ego/cluster_name/eservice/wsg/log (Linux)

LSF_LOGDIR\ego\cluster_name\eservice\wsg\log (Windows)
The service directory log files, logged by BIND, are found in

LSF_LOGDIR/ego/cluster_name/eservice/esd/conf/named/namedb/named.log.hostname (Linux)

LSF_LOGDIR\ego\cluster_name\eservice\esd\conf\named\namedb\named.log.hostname (Windows)

EGO log entry format

Log file entries follow the format

date time_zone log_level [process_id:thread_id] action:description/message

where the date is expressed in YYYY-MM-DD hh-mm-ss.sss.

For example, 2006-03-14 11:02:44.000 Eastern Standard Time ERROR [2488:1036] vemkdexit: vemkd is halting.

EGO log classes

Every log entry belongs to a log class. You can use log class as a mechanism to filter log entries by area. Log classes in combination with log levels allow you to troubleshoot using log entries that only address, for example, configuration.

Log classes are adjusted at run time using egosh debug.

Valid logging classes are as follows:

Class	Description
LC_ALLOC	Logs messages related to the resource allocation engine
LC_AUTH	Logs messages related to users and authentication
LC_CLIENT	Logs messages related to clients
LC_COMM	Logs messages related to communications
LC_CONF	Logs messages related to configuration
LC_CONTAINER	Logs messages related to activities
LC_EVENT	Logs messages related to the event notification service
LC_MEM	Logs messages related to memory allocation
LC_PEM	Logs messages related to the process execution manager (pem)
LC_PERF	Logs messages related to performance
LC_QUERY	Logs messages related to client queries
LC_RECOVER	Logs messages related to recovery and data persistence
LC_RSRC	Logs messages related to resources, including host status changes
LC_SYS	Logs messages related to system calls
LC_TRACE	Logs the steps of the program

EGO log levels

There are nine log levels that allow administrators to control the level of event information that is logged.

When you are troubleshooting, increase the log level to obtain as much detailed information as you can. When you are finished troubleshooting, decrease the log level to prevent the log files from becoming too large.

Valid logging levels are as follows:

Number	Level	Description
0	LOG_EMERG	Log only those messages in which the system is unusable.
1	LOG_ALERT	Log only those messages for which action must be taken immediately.
2	LOG_CRIT	Log only those messages that are critical.
3	LOG_ERR	Log only those messages that indicate error conditions.
4	LOG_WARNING	Log only those messages that are warnings or more serious messages. This is the default level of debug information.
5	LOG_NOTICE	Log those messages that indicate normal but significant conditions or warnings and more serious messages.
6	LOG_INFO	Log all informational messages and more serious messages.
7	LOG_DEBUG	Log all debug-level messages.
8	LOG_TRACE	Log all available messages.

EGO log level and class information retrieved from configuration files

When EGO is enabled, the pem and vemkd daemons read ego.conf to retrieve the following information (as corresponds to the particular daemon):

EGO_LOG_MASK: The log level used to determine the amount of detail logged.
EGO_DEBUG_PEM: The log class setting for pem.
EGO_DEBUG_VEMKD: The log class setting for vemkd.

The wsg daemon reads wsg.conf to retrieve the following information:

WSG_PORT: The port on which the Web service gateway (WebServiceGateway) should run
WSG_SSL: Whether the daemon should use Secure Socket Layer (SSL) for communication.
WSG_DEBUG_DETAIL: The log level used to determine the amount of detail logged for debugging purposes.
WSG_LOGDIR: The directory location where wsg.log files are written.

The service director daemon (named) reads named.conf to retrieve the following information:

logging, severity: The configured severity log class controlling the level of event information that is logged (critical, error, warning, notice, info, debug, or dynamic). In the case of a log class set to debug, a log level is required to determine the amount of detail logged for debugging purposes.

Why do log files grow so quickly?

Every time an EGO system event occurs, a log file entry is added to a log file. Most entries are informational in nature, except when there is an error condition. If your log levels provide entries for all information (for example, if you have set them to LOG_DEBUG), the files will grow quickly.

Suggested settings:

During regular EGO operation, set your log levels to LOG_WARNING. With this setting, critical errors are logged but informational entries are not, keeping the log file size to a minimum.
For troubleshooting purposes, set your log level to LOG_DEBUG. Because of the quantity of messages you will receive when subscribed to this log level, change the level back to LOG_WARNING as soon as you are finished troubleshooting.

Note:

If your log files are too long, you can always rename them for archive purposes. New, fresh log files will then be created and will log all new events.

How often should I maintain log files?

The growth rate of the log files is dependent on the log level and the complexity of your cluster. If you have a large cluster, daily log file maintenance may be required.

We recommend using a log file rotation utility to do unattended maintenance of your log files. Failure to do timely maintenance could result in a full file system which hinders system performance and operation.

Troubleshoot using multiple EGO log files

EGO log file locations and content

If a service does not start as expected, open the appropriate service log file and review the run-time information contained within it to discover the problem. Look for relevant entries such as insufficient disk space, lack of memory, or network problems that result in unavailable hosts.

Log file	Default location	What it contains
esc.log	Linux: LSF_LOGDIR/esc.log.`host_name` Windows: LSF_LOGDIR\esc.log.`host_name`	Logs service failures and service instance restarts based on availability plans.
named.log	Linux: LSF_LOGDIR/named.log.`host_name` Windows: LSF_LOGDIR\named.log.`host_name`	Logs information gathered during the updating and querying of service instance location; logged by BIND, a DNS server.
pem.log	Linux: LSF_LOGDIR/pem.log.`host_name` Windows: LSF_LOGDIR\pem.log.`host_name`	Logs remote operations (start, stop, control activities, failures). Logs tracked results for resource utilization of all processes associated with the host, and information for accounting or chargeback.
vemkd.log	Linux: LSF_LOGDIR/vemkd.log.`host_name` Windows: LSF_LOGDIR\vemkd.log.`host_name`	Logs aggregated host information about the state of individual resources, status of allocation requests, consumer hierarchy, resources assignment to consumers, and started operating system-level process.
wsg.log	Linux: LSF_LOGDIR/wsg.log.`host_name` Windows: LSF_LOGDIR\wsg.log.`host_name`	Logs service failures surrounding web services interfaces for web service clients (applications).

Match service error messages and corresponding log files

If you receive this message…	This may be the problem…	Review this log file
`failed to create vem working directory`	Cannot create work directory during startup	vemkd
`failed to open lock file`	Cannot get lock file during startup	vemkd
`failed to open host event file`	Cannot recover during startup because cannot open event file	vemkd
`lim port is not defined`	EGO_LIM_PORT in ego.conf is not defined	lim
`master candidate can not set GET_CONF=lim`	Wrong parameter defined for master candidate host (for example, EGO_GET_CONF=LIM)	lim
`there is no valid host in EGO_MASTER_LIST`	No valid host in master list	lim
`ls_getmyhostname fails`	Cannot get local host name during startup	pem
`temp directory (%s) not exist or not accessible, exit`	Tmp directory does not exist	pem
`incorrect EGO_PEM_PORT value %s, exit`	EGO_PEM_PORT is a negative number	pem
`chdir(%s) fails`	Tmp directory does not exist	esc
`cannot initialize the listening TCP port %d`	Socket error	esc
`cannot log on`	Log on to vemkd failed	esc
`vem_register: error in invoking vem_register function`	VEM service registration failed	wsg
`you are not authorized to unregister a service`	Either you are not authorized to unregister a service, or there is no registry client	wsg
`request has invalid signature: TSIG service.ego: tsig verify failure (BADTIME)`	Resource record updating failed	named