Running Jobs

Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the Cluster.

caution

The maximum amount of queued jobs (running or not) is 366. You can check this limit with bsc_queues.

Submitting jobs

The method for submitting jobs is to use the SLURM sbatch directives directly.

A job is the execution unit for SLURM. A job is defined by a text file containing a set of directives describing the job's requirements, and the commands to execute.

In order to ensure the proper scheduling of jobs, there are execution limitations in the number of nodes and cpus that cna be used at the same time by a group. You may check those limits using command 'bsc_queues'. If you need to run an execution bigger than the limits already granted, you may contact us.

Important accounting changes

To ensure fair and reliable CPU usage accounting information, we've enforced the need to use at least 40 threads for each GPU requested. In your job scripts, make sure that the amount of threads used meet the requirements for your GPU needs. Note that Slurm does refer to each thread as if it was a physical CPU.

The value of "cpu-per-task" x "task-per-node" should amount to those 40 threads. Remember that, by default, the value of "cpu-per-task" is 1.

If you can't change the number of tasks in your job, you can edit the number of CPUs per task (#SBATCH --cpus-per-task=). In order to not affect your executions, you can choose the desired CPUs per task by setting the environment variable OMP_NUM_THREADS (this variable may not work for every application).

Otherwise, an error message will be displayed pointing out this issue:

sbatch: error: Minimum cpus requested should be (nodes * gpus/node * 40). 
Cpus requested: X. Gpus: Y, Required cpus: Z
sbatch: error: Batch job submission failed: CPU count specification invalid

SBATCH commands

These are the basic directives to submit jobs with sbatch:

sbatch <job_script>

submits a “job script” to the queue system (see Job directives).

squeue

shows all the submitted jobs.

scancel <job_id>

remove the job from the queue system, canceling the execution of the processes, if they were still running.

srun --x11

For an allocating srun command, if the flag x11 is set the job will be handled as graphical (sets up X11 forwarding on the allocation) and you will be able to execute a graphical command. Meanwhile you do not close the current terminal you will get a graphical window.

salloc -J interactive --x11

Also, X11 forwarding can be set through interactive sessions.

Interactive Sessions

Allocation of an interactive session has to be done through SLURM:

Interactive session for 10 minutes, 1 task, 40 CPUs (cores) per task:

salloc -t 00:10:00 -n 1 -c 40 [ -J {job_name} ]

Interactive session for 48 CPUs (cores) + 2 GPUs:

salloc -c 48 --gres=gpu:2

Also, X11 forwarding can be set through interactive sessions.

salloc -J interactive --x11

Job directives

A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script and have to conform to either the sbatch syntaxes.

sbatch syxtax is of the form:

#SBATCH --directive=value

Additionally, the job script may contain a set of commands to execute. If not, an external script may be provided with the 'executable' directive. Here you may find the most common directives for both syntaxes:

#SBATCH --qos=debug

This partition is only intended for small tests.

#SBATCH --time=HH:MM:SS

The limit of wall clock time. This is a mandatory field and you must set it to a value greater than real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the time has passed.

#SBATCH -D pathname

The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.

#SBATCH --error=file

The name of the file to collect the standard error output (stderr) of the job.

#SBATCH --output=file

The name of the file to collect the standard output (stdout) of the job.

#SBATCH --ntasks=number

The number of processes to start.

Optionally, you can specify how many threads each process would open with the directive:

#SBATCH --cpus-per-task=number

The number of cpus assigned to the job will be the total_tasks number * cpus_per_task number.

#SBATCH --ntasks-per-node=number

The number of tasks assigned to a node.

#SBATCH --gres=gpu:number

The number of GPU assigned to a node.

#SBATCH --exclusive

To request an exclusive use of a compute node without sharing the resources with other users.

#SBATCH --reservation=reservation_name

The reservation where your jobs will be allocated (assuming that your account has access to that reservation). In some ocasions, node reservations can be granted for executions where only a set of accounts can run jobs. Useful for courses.

#SBATCH --mail-type=[begin|end|all|none]
#SBATCH --mail-user=<your_email>

#Fictional example (notified at the end of the job execution):
#SBATCH --mail-type=end
#SBATCH --mail-user=dannydevito@bsc.es

Those two directives are presented as a set because they need to be used at the same time. They will enable e-mail notifications that are triggered when a job starts its execution (begin), ends its execution (end) or both (all). The "none" option doesn't trigger any e-mail, it is the same as not putting the directives. The only requisite is that the e-mail specified is valid and also the same one that you use for the HPC User Portal (what is the HPC User Portal, you ask? Excellent question, check it out here).

#SBATCH --switches=number@timeout

By default, Slurm schedules a job in order to use the minimum amount of switches. However, a user can request a specific network topology in order to run his job. Slurm will try to schedule the job for timeout minutes. If it is not possible to request number switches (from 1 to 14) after timeout minutes, Slurm will schedule the job by default.

Variable	Meaning
SLURM_JOBID	Specifies the job ID of the executing job
SLURM_NPROCS	Specifies the total number of processes in the job
SLURM_NNODES	Is the actual number of nodes assigned to run your job
SLURM_PROCID	Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)
SLURM_NODEID	Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)
SLURM_LOCALID	Specifies the node-local task ID for the process within a job

Examples

sbatch examples

Example for a sequential job:

#!/bin/bash
#SBATCH --job-name="test_serial"
#SBATCH -D .
#SBATCH --output=serial_%j.out
#SBATCH --error=serial_%j.err
#SBATCH --ntasks=1
#SBATCH --time=00:02:00
./serial_binary> serial.out

The job would be submitted using:

sbatch ptest.cmd

Example for a parallel job:

#!/bin/bash
#SBATCH --job-name=test_parallel
#SBATCH -D .
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=20
#SBATCH --cpus-per-task=4
#SBATCH --time=00:02:00
#SBATCH --gres=gpu:2
mpirun ./parallel_binary> parallel.output

Interpreting job status and reason codes

When using squeue, Slurm will report back the status of your launched jobs. If they are still waiting to enter execution, they will be followed by the reason. Slurm uses codes to display this information, so in this section we will be covering the meaning of the most relevant ones.

Job state codes

This list contains the usual state codes for jobs that have been submitted:

COMPLETED (CD): The job has completed the execution.
COMPLETING (CG): The job is finishing, but some processes are still active.
FAILED (F): The job terminated with a non-zero exit code.
PENDING (PD): The job is waiting for resource allocation. The most common state after running "sbatch", it will run eventually.
PREEMPTED (PR): The job was terminated because of preemption by another job.
RUNNING (R): The job is allocated and running.
SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
STOPPED (ST): A running job has been stopped with its cores retained.

Job reason codes

This list contains the most common reason codes of the jobs that have been submitted and are still not in the running state:

Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
Dependency: This job is waiting for a dependent job to complete and will run afterwards.
Resources: The job is waiting for resources to become available and will eventually run.
InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.

Running Jobs

Submitting jobs​

Important accounting changes​

SBATCH commands​

Interactive Sessions​

Job directives​

Examples​

sbatch examples​

Interpreting job status and reason codes​

Job state codes​

Job reason codes​