Use any of the following methods to troubleshoot your Session Scheduler jobs.
Controls the amount of logging
Filters out some log classes, or shows all log classes
By default, no log classes are shown
Enables logging on some or all components
By default, logging is disabled on all components
libvem.so controls logging by the libvem.so loaded by the SD, SSM and ssched
Enabling debugging of the Session Scheduler automatically enables logging by the libvem.so loaded by the Session Scheduler
Enables logging on some/all hosts
By default, logging is disabled on all remote hosts
Directs logging to /tmp/ssched/job_ID.job_index/ instead of stderr on each remote host
Useful if too much debugging info is slowing down the network connection
By default, debugging info is sent to stderr
SSCHED_DEBUG_LOG_MASK=LOG_WARNING
SSCHED_DEBUG_CLASS=ALL
SSCHED_DEBUG_MODULES=ALL
SSCHED_DEBUG_LOG_MASK=LOG_INFO
SSCHED_DEBUG_CLASS=ALL
SSCHED_DEBUG_MODULES=ALL
SSCHED_DEBUG_LOG_MASK=LOG_DEBUG
SSCHED_DEBUG_CLASS=ALL
SSCHED_DEBUG_MODULES=ALL
Example output of ssched -2:
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_UPDATE_SUMMARY_INTERVAL = 1
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_UPDATE_SUMMARY_BY_TASK = 0
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_REQUEUE_LIMIT = 1
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_RETRY_LIMIT = 1
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_MAX_TASKS = 10
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_MAX_RUNLIMIT = 600
Nov 22 22:22:45 2012 18275 6 9.1.2 SSCHED_ACCT_DIR = /home/user1/ssched
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <1> parsed.
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <2> parsed.
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <3> parsed.
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <4> parsed.
Nov 22 22:22:45 2012 18275 6 9.1.2 Task <5> parsed.
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <1> submitted. Command <sleep 0>;
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <2> submitted. Command <sleep 0>;
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <3> submitted. Command <sleep 0>;
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <4> submitted. Command <sleep 0>;
Nov 22 22:22:47 2012 18275 6 9.1.2 Task <5> submitted. Command <sleep 0>;
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <1> done successfully.
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <2> done successfully.
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <4> done successfully.
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <3> done successfully.
Nov 22 22:22:54 2012 18275 6 9.1.2 Task <5> done successfully.
Task Summary
Submitted: 5
Done: 5
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_UPDATE_SUMMARY_INTERVAL = 1
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_UPDATE_SUMMARY_BY_TASK = 0
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_REQUEUE_LIMIT = 1
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_RETRY_LIMIT = 1
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_MAX_TASKS = 10
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_MAX_RUNLIMIT = 600
Nov 22 22:28:36 2012 19409 6 9.1.2 SSCHED_ACCT_DIR = /home/user1/ssched
Nov 22 22:28:36 2012 19409 6 9.1.2 Task <1> parsed.
Nov 22 22:28:38 2012 19409 6 9.1.2 Task <1> submitted. Command <exit 1>;
Nov 22 22:28:43 2012 19409 6 9.1.2 Task <1> exited with code 1.
Nov 22 22:28:43 2012 19409 6 9.1.2 Task <1> submitted. Command <exit 1>;
Nov 22 22:28:43 2012 19409 6 9.1.2 Task <1> exited with code 1.
Task Summary
Submitted: 1
Requeued: 1
Done: 0
Exited: 2
Execution Errors: 2
Dispatch Errors: 0
Other Errors: 0
Task Error Summary
Execution Error
Task ID: 1
Submit Time: Thu Nov 22 22:28:38 2012
Start Time: Thu Nov 22 22:28:43 2012
End Time: Thu Nov 22 22:28:43 2012
Exit Code: 1
Exit Reason: Normal exit
Exec Hosts: hostA
Exec Home: /home/user1/
Exec Dir: /home/user1/src/lsf9.1ss/ssched
Command: exit 1
Action: Requeue exit value match; task will be requeued
Execution Error
Task ID: 1
Submit Time: Thu Nov 22 22:28:43 2012
Start Time: Thu Nov 22 22:28:43 2012
End Time: Thu Nov 22 22:28:43 2012
Exit Code: 1
Exit Reason: Normal exit
Exec Hosts: hostA
Exec Home: /home/user1/
Exec Dir: /home/user1/src/lsf9.1ss/ssched
Command: exit 1
Action: Task requeue limit reached; task will not be requeued
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_UPDATE_SUMMARY_INTERVAL = 1
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_UPDATE_SUMMARY_BY_TASK = 0
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_REQUEUE_LIMIT = 1
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_RETRY_LIMIT = 1
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_MAX_TASKS = 10
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_MAX_RUNLIMIT = 600
Nov 22 22:35:40 2012 20769 6 9.1.2 SSCHED_ACCT_DIR = /home/user1/ssched
Nov 22 22:35:40 2012 20769 6 9.1.2 Task <1> parsed.
Nov 22 22:35:42 2012 20769 6 9.1.2 Task <1> submitted. Command <sleep 0>;
Nov 22 22:35:47 2012 20769 6 9.1.2 Task <1> had a dispatch error. Task will be retried.
Nov 22 22:35:47 2012 20769 6 9.1.2 Task <1> submitted. Command <sleep 0>;
Nov 22 22:35:47 2012 20769 6 9.1.2 Task <1> had a dispatch error. Retry limit reached.
Task Summary
Submitted: 1
Done: 0
Exited: 1
Execution Errors: 0
Dispatch Errors: 1
Other Errors: 0
Task Error Summary
Dispatch Error
Task ID: 1
Submit Time: Thu Nov 22 22:35:47 2012
Failure Reason: Pre-execution command failed
Command: sleep 0
Pre-Exec: exit 1
Start time: Thu Nov 22 22:35:47 2012
Execution host: hostA
Action: Task retry limit reached; task will not be retried
The "Task Summary" and "Summary of Errors" sections are sent to stdout. All other output is sent to stderr.
After the tasks have been submitted to the Session Scheduler and started, users can enable additional debugging by Session Scheduler components by sending a SIGUSR1 signal.
SSCHED_DEBUG_LOG_MASK=LOG_DEBUG
SSCHED_DEBUG_CLASS=ALL
SSCHED_DEBUG_MODULES=ALL
The additional log messages are sent to stderr.
SSCHED_DEBUG_LOG_MASK=LOG_DEBUG
SSCHED_DEBUG_CLASS=ALL
SSCHED_DEBUG_MODULES=ALL
SSCHED_DEBUG_REMOTE_HOSTS=ALL
SSCHED_DEBUG_REMOTE_FILE=Y
The debug messages are saved to a file in /tmp/ssched/. You are responsible for deleting this file when it is no longer needed.
If a SIGUSR1 signal is sent, SIGUSR2 restores debugging to its original level.
The Session Scheduler caches host info from LIM. If the host factor of a host is changed after the Session Scheduler starts, the Session Scheduler will not see the updated host factor. The host factor is used in the task accounting log.
Session Scheduler does not support per task memory or swap utilization tracking from ssacct. Run bacct to see aggregate memory and swap utilization.
When specifying a multiline command line as a ssched command line parameter, you must enclose the command in quotes. A multiline command line is any command containing a semi-colon (;). For example:
ssched -o my.out "hostname; ls"
When specifying a multiline command line as a parameter in a task definition file, you must NOT use quotes. For example:
cat my.tasks
-o my.out hostname; ls
If you submit a shell script containing multiple ssched commands, bjobs -l only shows the task summary for the currently running ssched instance. Enable task accounting and examine the accounting file to see information for tasks from all ssched instances in the shell script.
Submitting a large number of tasks as part of one session may cause a slight delay between when the Session Scheduler starts and when tasks are dispatched to execution agents. The Session Scheduler must parse and submit each task before it begins dispatching any tasks. Parsing 50,000 tasks can take up to 2 minutes before dispatching starts.
After all tasks have completed, the Session Scheduler will take some time to terminate all execution agents and to clean up temporary files. A minimum of 20 seconds is normal, longer for larger allocations.
Session Scheduler handles the following signals: SIGINT, SIGTERM, SIGUSR1, SIGSTOP, SIGTSTP, and SIGCONT. All other signals cause ssched to exit immediately. No summary is output and task accounting information is not saved. The signals Session Scheduler handles will be expanded in future releases.