Troubleshooting

Use any of the following methods to troubleshoot your Session Scheduler jobs.

ssched environment variables

Before submitting the ssched command, You can set the following environment variables to enable additional debugging information:
SSCHED_DEBUG_LOG_MASK=[LOG_INFO | LOG_DEBUG | LOG_DEBUG1 | ...]

Controls the amount of logging

SSCHED_DEBUG_CLASS=ALL or SSCHED_DEBUG_CLASS=[LC_TRACE] [LC_FILE] [...]
  • Filters out some log classes, or shows all log classes

  • By default, no log classes are shown

SSCHED_DEBUG_MODULES=ALL or SSCHED_DEBUG_MODULES=[ssched] [libvem.so] [sservice] [sschild]
  • Enables logging on some or all components

  • By default, logging is disabled on all components

  • libvem.so controls logging by the libvem.so loaded by the SD, SSM and ssched

  • Enabling debugging of the Session Scheduler automatically enables logging by the libvem.so loaded by the Session Scheduler

SSCHED_DEBUG_REMOTE_HOSTS=ALL or SSCHED_DEBUG_REMOTE_HOSTS=[hostname1] [hostname2] [...]
  • Enables logging on some/all hosts

  • By default, logging is disabled on all remote hosts

SSCHED_DEBUG_REMOTE_FILE=Y
  • Directs logging to /tmp/ssched/job_ID.job_index/ instead of stderr on each remote host

  • Useful if too much debugging info is slowing down the network connection

  • By default, debugging info is sent to stderr

ssched debug options

The ssched options -1, -2, and -3 are shortcuts for the following environment variables.
ssched -1
Is a shortcut for:
  • SSCHED_DEBUG_LOG_MASK=LOG_WARNING

  • SSCHED_DEBUG_CLASS=ALL

  • SSCHED_DEBUG_MODULES=ALL

ssched -2
Is a shortcut for:
  • SSCHED_DEBUG_LOG_MASK=LOG_INFO

  • SSCHED_DEBUG_CLASS=ALL

  • SSCHED_DEBUG_MODULES=ALL

ssched -3
Is a shortcut for:
  • SSCHED_DEBUG_LOG_MASK=LOG_DEBUG

  • SSCHED_DEBUG_CLASS=ALL

  • SSCHED_DEBUG_MODULES=ALL

Example output of ssched -2:

Example output of ssched -2:

Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_UPDATE_SUMMARY_INTERVAL = 1
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_UPDATE_SUMMARY_BY_TASK = 0
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_REQUEUE_LIMIT = 1
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_RETRY_LIMIT = 1
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_MAX_TASKS = 10
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_MAX_RUNLIMIT = 600
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_ACCT_DIR = /home/user1/ssched
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <1> parsed.
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <2> parsed.
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <3> parsed.
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <4> parsed.
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <5> parsed.
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <1> submitted. Command <sleep 0>;
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <2> submitted. Command <sleep 0>;
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <3> submitted. Command <sleep 0>;
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <4> submitted. Command <sleep 0>;
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <5> submitted. Command <sleep 0>;
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <1> done successfully. 
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <2> done successfully.
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <4> done successfully. 
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <3> done successfully. 
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <5> done successfully. 

Task Summary
Submitted:                  5
Done:                       5

Example output of ssched -2 with requeue

Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_UPDATE_SUMMARY_INTERVAL = 1
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_UPDATE_SUMMARY_BY_TASK = 0
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_REQUEUE_LIMIT = 1
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_RETRY_LIMIT = 1
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_MAX_TASKS = 10
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_MAX_RUNLIMIT = 600
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_ACCT_DIR = /home/user1/ssched
Nov 22 22:28:36 2012 19409 6 9.1.2 Task <1> parsed.
Nov 22 22:28:38 2012 19409 6 9.1.2 Task <1> submitted. Command <exit 1>;
Nov 22 22:28:43 2012 19409 6 9.1.2 Task <1> exited with code 1.
Nov 22 22:28:43 2012 19409 6 9.1.2 Task <1> submitted. Command <exit 1>;
Nov 22 22:28:43 2012 19409 6 9.1.2 Task <1> exited with code 1.

Task Summary
Submitted:                  1
Requeued:                   1

Done:                       0
Exited:                     2
  Execution Errors: 2
  Dispatch Errors:  0
  Other Errors:     0

Task Error Summary

Execution Error
Task ID:                 1
Submit Time:             Thu Nov 22 22:28:38 2012
Start Time:              Thu Nov 22 22:28:43 2012
End Time:                Thu Nov 22 22:28:43 2012
Exit Code:               1
Exit Reason:             Normal exit
Exec Hosts:              hostA
Exec Home:               /home/user1/
Exec Dir:                /home/user1/src/lsf9.1ss/ssched
Command:                 exit 1
Action:                  Requeue exit value match; task will be requeued

Execution Error
Task ID:                 1
Submit Time:             Thu Nov 22 22:28:43 2012
Start Time:              Thu Nov 22 22:28:43 2012
End Time:                Thu Nov 22 22:28:43 2012
Exit Code:               1
Exit Reason:             Normal exit
Exec Hosts:              hostA
Exec Home:               /home/user1/
Exec Dir:                /home/user1/src/lsf9.1ss/ssched
Command:                 exit 1
Action:                  Task requeue limit reached; task will not be requeued

Example output of ssched -2 with retry

Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_UPDATE_SUMMARY_INTERVAL = 1
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_UPDATE_SUMMARY_BY_TASK = 0
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_REQUEUE_LIMIT = 1
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_RETRY_LIMIT = 1
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_MAX_TASKS = 10
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_MAX_RUNLIMIT = 600
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_ACCT_DIR = /home/user1/ssched
Nov 22 22:35:40 2012 20769 6 9.1.2 Task <1> parsed.
Nov 22 22:35:42 2012 20769 6 9.1.2 Task <1> submitted. Command <sleep 0>;
Nov 22 22:35:47 2012 20769 6 9.1.2 Task <1> had a dispatch error. Task will be retried.
Nov 22 22:35:47 2012 20769 6 9.1.2 Task <1> submitted. Command <sleep 0>;
Nov 22 22:35:47 2012 20769 6 9.1.2 Task <1> had a dispatch error. Retry limit reached.

Task Summary
Submitted:                  1
Done:                       0
Exited:                     1
  Execution Errors: 0
  Dispatch Errors:  1
  Other Errors:     0

Task Error Summary

Dispatch Error
Task ID:                 1
Submit Time:             Thu Nov 22 22:35:47 2012
Failure Reason:          Pre-execution command failed
Command:                 sleep 0
Pre-Exec:                exit 1
Start time:              Thu Nov 22 22:35:47 2012
Execution host:          hostA
Action:                  Task retry limit reached; task will not be retried
Note:

The "Task Summary" and "Summary of Errors" sections are sent to stdout. All other output is sent to stderr.

Send SIGUSR1 signal

After the tasks have been submitted to the Session Scheduler and started, users can enable additional debugging by Session Scheduler components by sending a SIGUSR1 signal.

To enable additional debugging by the ssched and libvem components, send a SIGUSR1 to the ssched_real process. This enables the following:
  • SSCHED_DEBUG_LOG_MASK=LOG_DEBUG

  • SSCHED_DEBUG_CLASS=ALL

  • SSCHED_DEBUG_MODULES=ALL

The additional log messages are sent to stderr.

To enable additional debugging by the sservice and sschild components, send a SIGUSR1 on the remote host to the sservice process. This enables the following:
  • SSCHED_DEBUG_LOG_MASK=LOG_DEBUG

  • SSCHED_DEBUG_CLASS=ALL

  • SSCHED_DEBUG_MODULES=ALL

  • SSCHED_DEBUG_REMOTE_HOSTS=ALL

  • SSCHED_DEBUG_REMOTE_FILE=Y

The debug messages are saved to a file in /tmp/ssched/. You are responsible for deleting this file when it is no longer needed.

Send SIGUSR2 signal

If a SIGUSR1 signal is sent, SIGUSR2 restores debugging to its original level.

Known issues and limitations

General issues

  • The Session Scheduler caches host info from LIM. If the host factor of a host is changed after the Session Scheduler starts, the Session Scheduler will not see the updated host factor. The host factor is used in the task accounting log.

  • Session Scheduler does not support per task memory or swap utilization tracking from ssacct. Run bacct to see aggregate memory and swap utilization.

  • When specifying a multiline command line as a ssched command line parameter, you must enclose the command in quotes. A multiline command line is any command containing a semi-colon (;). For example:

    ssched -o my.out "hostname; ls"

    When specifying a multiline command line as a parameter in a task definition file, you must NOT use quotes. For example:

    cat my.tasks
    -o my.out hostname; ls
  • If you submit a shell script containing multiple ssched commands, bjobs -l only shows the task summary for the currently running ssched instance. Enable task accounting and examine the accounting file to see information for tasks from all ssched instances in the shell script.

  • Submitting a large number of tasks as part of one session may cause a slight delay between when the Session Scheduler starts and when tasks are dispatched to execution agents. The Session Scheduler must parse and submit each task before it begins dispatching any tasks. Parsing 50,000 tasks can take up to 2 minutes before dispatching starts.

  • After all tasks have completed, the Session Scheduler will take some time to terminate all execution agents and to clean up temporary files. A minimum of 20 seconds is normal, longer for larger allocations.

  • Session Scheduler handles the following signals: SIGINT, SIGTERM, SIGUSR1, SIGSTOP, SIGTSTP, and SIGCONT. All other signals cause ssched to exit immediately. No summary is output and task accounting information is not saved. The signals Session Scheduler handles will be expanded in future releases.