MareNostrum 5

Running jobs

Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the cluster.

Submitting jobs

The method for submitting jobs is to use the Slurm batch directives.

To access more detailed information:

man sbatch
man srun
man salloc

SBATCH commands

The basic Slurm directives for submitting and overseeing jobs using sbatch include the following (refer to Job directives for additional options):

Submit a job script to the queue system:

sbatch -A, --account={account} -q, --qos={qos} {job_script} [args]

Example

sbatch -A bsc21 -q acc_bsccase myjob.sh

WARNING

The system will generate specific error messages if an attempt is made to submit a job without specifying a Slurm account and/or queue (QoS):

salloc: error: No account specified, please specify an account
salloc: error: Job submit/allocate failed: Unspecified error

salloc: error: No QoS specified, please specify a QoS
salloc: error: Job submit/allocate failed: Unspecified error

INFO

Accounts refer to Slurm accounts, which share name with your project's unixgroup.

You can get your available account/s (unixgroup) by running:

bsc_project list

And your QoS with:

bsc_queues

Display all submitted jobs (from all your current accounts/projects):
```
squeue
```
Display all submitted jobs from a specific account or project:
```
squeue -A, --account={account}
```
Remove a job from the queue system, canceling the execution of the processes (if they were still running):
```
scancel {jobid}
```

Queues (QoS)

Several queues are present in the machines, and users may access different queues. Queues have unlike limits regarding the number of cores and duration for the jobs.

Anytime you can check all queues you have access to and their limits by using:

bsc_queues

Standard queues

Standard queues (QoS) limits are as follows:

GPP:

Queue	Max. number of nodes (cores)	Wallclock	Slurm QoS name	Slurm partition name
BSC	125 (14,000)	48h	gp_bsc{case,cs,es,ls}	gpp
Data	4 (448)	72h	gp_data	gpdata
Debug	32 (3,584)	2h	gp_debug	gpp
EuroHPC	800 (89,600)	72h	gp_ehpc	gpp
HBM	50 (5,600)	72h	gp_hbm	hbm
Interactive	1 (32)	2h	gp_interactive	gpinteractive
RES Class A	200 (22,400)	72h	gp_resa	gpp
RES Class B	200 (22,400)	48h	gp_resb	gpp
RES Class C	50 (5,600)	24h	gp_resc	gpp
Training	32 (3,584)	48h	gp_training	gpp

ACC:

Queue	Max. number of nodes (cores)	Wallclock	Slurm QoS name	Slurm partition name
BSC	25 (2,000)	48h	acc_bsc{case,cs,es,ls}	acc
Debug	8 (640)	2h	acc_debug	acc
EuroHPC	100 (8,000)	72h	acc_ehpc	acc
Interactive	1 (40)	2h	acc_interactive	accinteractive
RES Class A	50 (4,000)	72h	acc_resa	acc
RES Class B	50 (4,000)	48h	acc_resb	acc
RES Class C	10 (800)	24h	acc_resc	acc
Training	4 (320)	48h	acc_training	acc

REMARK

Each BSC QoS has a limit of 320 nodes for the GPP partition and 80 nodes for the ACC partition. This limit refers to the total number of nodes being used by all running jobs in that queue. If your submitted job is pending due to the QOSGrpNodeLimit, you must wait until nodes within that limit become available.
If you require additional computational resources, you can apply through:
- RES
- EuroHPC

Special queues

Additionally, special/extra queues can be provided for extended or larger executions, subject to demonstrating the scalability and performance of the application, taking into account factors such as demand and the current workload of the machine, among other conditions.

To request access to these specialized queues, please get in touch with us.

Job directives

A job script must include a set of directives to convey the job's characteristics to the batch system. These directives appear as comments within the job script and must adhere to the syntax specified for sbatch. Additionally, the job script may contain a set of commands to execute.

sbatch syntax is of the form:

#SBATCH --directive=value

Below are some of the most frequently used directives.

Resource reservation and mandatory directives

Request the queue for the job :
```
#SBATCH --qos={qos}
#SBATCH -q {qos}
```
Use a given Slurm account, defined by a unixgroup:
```
#SBATCH --account={account}
#SBATCH -A {account}
```
REMARK
- Remember that these two parameter are mandatory when submitting jobs in the current MareNostrum 5 configuration.
Set a time limit for the total runtime of the job. By default, it uses the maximum allowed by the queue:
```
#SBATCH --time={time} #DD-HH:MM:SS
#SBATCH -t {time}
```
caution
- You cannot set a time limit greater than the maximum allowed by the queue.
- Please be aware that your job will be terminated once the specified time has elapsed.

Common resource directives

The minimum information Slurms needs to know is the number of tasks/processes to set and the number of CPU to use or each of them. This means that, since the number of CPU per node is 112 for the GPP partition and 80 for the ACC one, it can figure out the number of nodes it needs to allocate just from two directives:

Set the number of processes to start:

#SBATCH --ntasks={number}
#SBATCH -n {number}

Specify the number of threads each process will launch with the following directive:
```
#SBATCH --cpus-per-task={number}
#SBATCH -c {number}
```
REMARK
- The total number of cores allocated to the job will be calculated as the product of ntasks and cpus-per-task.

Specific resource directives

For a finer control of the requested resources, there are multiple other configurations of directives that can achieve the same result, albeit many of them redundant, these are some of the lesser used directives:

Set the number of tasks assigned to a node:
```
#SBATCH --ntasks-per-node={number}
```
Set the number of tasks assigned to a socket:
```
#SBATCH --ntasks-per-socket={number}
```

Specify the number of nodes requested:

#SBATCH --nodes={number}
#SBATCH -N {number}

Request exclusive use of assigned nodes without sharing resources with other users:
```
#SBATCH --exclusive
```
REMARKS
- This condition applies only to jobs that request a single node.
- However, multi-node jobs will automatically utilize all requested nodes in exclusive mode.

Example 1

For instance, let's say we specify only one node with this parameter, but we only want to use two tasks that use only one core each:

#SBATCH -N 1
#SBATCH -n 2
#SBATCH -c 1

This will only request using the total resources, just two cores. The remaining resources of the used node will be left available for other users.

Example 2

If we request more than one node (keeping the other parameters unchanged), as follows:

#SBATCH -N 2
#SBATCH -n 2
#SBATCH -c 1

It will demand exclusivity for both nodes, implying a request for the total resources of both nodes without sharing them with other users. In MareNostrum 5, each GPP node has 112 cores, so in this example, it will request 224 cores.

IMPORTANT

It's essential to be aware of this behaviour because it will allocate the CPU time of all 224 cores to your computation time budget (if applicable), even if you specify the usage of only two cores.

Sometimes, node reservations may be approved for executions exclusive to specific accounts, which is particularly beneficial for educational courses.
To specify the reservation name for job allocation, assuming your account has access to that reservation:
```
#SBATCH --reservation={name}
```
Choose a specific configuration, such as running the job on a 'HighMem' node:
```
#SBATCH --constraint=highmem
```
REMARKS
- The accounting for one core hour on both standard and highmem nodes is identical, equating to 1 core-hour per core per hour allocated to your budget.
- Lacking this directive, jobs will be assigned to standard nodes with 2 GB of RAM per core. The availability of high-memory nodes is limited, with only 216 out of 6,408 nodes. Consequently, when requesting these nodes, expect significantly longer queueing times to fulfil the resource request before your job can run.
- To expedite queue turnaround times, you can choose normal nodes and reduce the number of processes per node. To achieve this, specify the flag #SBATCH --cpus-per-task=number, and your budget will be charged for all requested cores.

Working directory and job output/error files

Establish the working directory for your job, indicating the location where the job will be executed. By default, it uses the current working directory:
```
#SBATCH --chdir={pathname}
#SBATCH -D {pathname}
```

Specify the filenames to which to redirect the job's standard output (stdout) and standard error output (stderr):
```
#SBATCH --output={filename}
#SBATCH --error={filename}
```

Performance configuration

Set the number of threads per core (SMT configuration):
```
#SBATCH  --threads-per-core={1|2}
```
Indicate whether to utilize additional threads within in-core multi-threading (SMT configuration):
```
#SBATCH  --hint={nomultithread|multithread}
```
REMARK
- --hint={nomultithread|multithread} and --threads-per-core={1|2} produce identical results.

Job types

Interactive jobs

Allocation of an interactive session has to be done through Slurm:

salloc -A, --account={account} -q, --qos={qos} [ OPTIONS ]

Here are some of the parameters you can use with salloc (see also Job directives):

-J, --job-name={name}
-q, --qos={name}
-p, --partition={name}

-t, --time={time}
-n, --ntasks={number}
-c, --cpus-per-task={number}
-N, --nodes={number}

--exclusive
--x11

Examples

Request an interactive session for 10 minutes, 1 task, 4 CPUs (cores) per task, in the "GPP Interactive" partition (gpinteractive):
```
salloc -A bsc21 -t 00:10:00 -n 1 -c 4 -q gp_interactive -J myjob
```
Request an interactive session on a GPP compute node in exclusive mode (without sharing resources):
```
salloc -A bsc21 -q gp_debug --exclusive
salloc -A bsc21 -q gp_bsccase --exclusive
```
Request an interactive session for 40 CPUs (cores) + 2 GPUs:
```
salloc -A bsc21 -q acc_bsccase -n 2 -c 20 --gres=gpu:2
```

Job arrays

Submit a job array for the execution of multiple jobs with identical parameters:
```
#SBATCH --array={indexes}
```
REMARKS
- The index specification determines which array index values to utilize.
- Multiple values can be indicated using a comma-separated list and/or a range of values with a "-" separator.
Job arrays will include the configuration of two additional environment variables:
1. SLURM_ARRAY_JOB_ID: will be assigned the initial job ID of the array.
2. SLURM_ARRAY_TASK_ID: will be assigned the job array index value.
Example
```
sbatch --array=1-3 job.cmd
Submitted batch job 36
```
Will create a job array consisting of three jobs, and subsequently, the environment variables will be configured as follows:
```
# Job 1
SLURM_JOB_ID=36
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=1

# Job 2
SLURM_JOB_ID=37
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=2

# Job 3
SLURM_JOB_ID=38
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=3
```

Jobs with GPU (Accelerated partition)

The directive to request the use of GPU in the ACC partition of MN5 is the following:

#SBATCH --gres=gpu:{1-4}

Due to how the hour calculation is configured, for each GPU you want to use, you must request 20 CPU (each ACC node has 80 CPU and 4 GPU). These resources can be used in any configuration (different --ntasks and --cpus-per-task), and the best one will depend on the used software.

Important

The --gres=gpu:{1-4} directive specifies the number of GPU per node requested, so its value should never exceed 4.

Examples

Job with a single GPU and a single MPI task:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --gres=gpu:1

Job with three GPU and six MPI tasks:

#SBATCH --ntasks=6
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:3

Job with eight GPU in two nodes and eight MPI tasks (one per GPU):

#SBATCH --ntasks=8
#SBATCH --cpus-per-task=20
#SBATCH --gres=gpu:4

Heterogeneous jobs

It's possible to run jobs with multiples components, which support independent partitions, accounts and queues. This may be useful for jobs containing diverse tasks that have better performance using a different number of cores, or jobs which contain both GPU accelerated and non accelerated tasks.

Heterogeneous jobs may be requested with both salloc and sbatch commands, following a slightly different structure:

Single line jobs must have both resources separated by a colon:

salloc -A bscXX --qos=gp_bench -n 112 --partition=gpp : --qos=acc_bench -n 80 --partition=acc --gres=gpu:4 [...]

Jobscripts also separate both parts, but now using a line that contains '#SBATCH hetjob':

  ...
  #SBATCH --qos=gp_bsccs --cpus-per-task=112 --partition=gpp
  #SBATCH hetjob
  #SBATCH --qos=acc_bsccs --cpus-per-task=80 --partition=acc --gres=gpu:4
  ...

To use modules from a different partition you may modify the MODULEPATH environment variable, which contains the different paths where modulefiles are stored.

REMARK

Be careful not to overwrite your MODULEPATH variable, otherwise you won't be able to see or load the available modules.

For more detailed information visit the Slurm documentation.

Job examples

Sequential job

Executing a sequential job:

#!/bin/bash
#SBATCH --job-name=seq_job
#SBATCH --chdir=.
#SBATCH --output=serial_%j.out
#SBATCH --error=serial_%j.err
#SBATCH --ntasks=1
#SBATCH --time=00:02:00

./serial_binary

Parallel jobs

Running an OpenMP job on a single node utilizing 112 cores within the 'gp_debug' queue:

#!/bin/bash
#SBATCH --job-name=omp_job
#SBATCH --chdir=.
#SBATCH --output=omp_%j.out
#SBATCH --error=omp_%j.err
#SBATCH --cpus-per-task=112
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --qos=gp_debug

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

./openmp_binary

Running a pure MPI job on two complete nodes:

#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=224

srun ./mpi_binary

Running a hybrid MPI+OpenMP job on two complete nodes with 56 MPI tasks (28 per node), each using 4 cores via OpenMP:

#!/bin/bash
#SBATCH --job-name=hybrid_job
#SBATCH --chdir=.
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=56
#SBATCH --cpus-per-task=4
#SBATCH --tasks-per-node=28
#SBATCH --time=00:02:00

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

srun ./hybrid_binary

Running with GPUs on a complete ACC node:

#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH -D .
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=80
#SBATCH --cpus-per-task=1
#SBATCH --time=00:02:00
#SBATCH --gres=gpu:4

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

srun ./gpu_binary

Other Slurm concepts

Special considerations

Attention

Not following these directives will most likely result in your job failing.

When using srun, the SLURM_CPUS_PER_TASK variable is no longer taken into consideration, as it also doesn't --cpus-per-task, so you have to either sxplicitly specify --cpus-per-task in your srun command, or set the following environment variable in your job (this only applies to srun, not mpirun):

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

When using mpirun instead of srun, the SLURM_CPU_BIND variable must be set to "none":

export SLURM_CPU_BIND=none

When using the Nvidia HPC SDK in the accelerated partition, MPI binaries must be run using mpirun, rather than srun. This is due to the OpenMPI inside the Nvidia SDK not being entirely compatible with Marenostrum5's Slurm configuration, causing it to fail.

Some useful Slurm's environment variables

Variable	Meaning
SLURM_JOBID	Specifies the job ID of the executing job
SLURM_NPROCS	Specifies the total number of processes in the job
SLURM_NNODES	Is the actual number of nodes assigned to run your job
SLURM_PROCID	Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)
SLURM_NODEID	Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)
SLURM_LOCALID	Specifies the node-local task ID for the process within a job

Other environment variables

Variable	Meaning
OMP_NUM_THREADS	`OMP_NUM_THREADS` is an environment variable automatically set in SLURM jobs. By default, it is assigned the value specified in the `cpus-per-task` parameter.

Understanding job status and reason codes

When using the squeue command, Slurm provides information about the status of your submitted jobs. If a job is still waiting before execution, it will be accompanied by a reason. Slurm employs specific codes to show this information, and the following section explains the significance of the most relevant ones.

Job state codes

Common state codes for submitted jobs include:

COMPLETED (CD): The job has completed the execution.
COMPLETING (CG): The job is finishing, but some processes are still active.
FAILED (F): The job terminated with a non-zero exit code.
PENDING (PD): The job is waiting for resource allocation. The most common state after running sbatch, it will run eventually.
PREEMPTED (PR): The job was terminated because of preemption by another job.
RUNNING (R): The job is allocated and running.
SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
STOPPED (ST): A running job has been stopped with its cores retained.

Job reason codes

The following list outlines the most frequently encountered reason codes for jobs that have been submitted but have not yet entered the running state:

Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
Dependency: This job is waiting for a dependent job to complete and will run afterwards.
Resources: The job is waiting for resources to become available and will eventually run.
InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.

Resource usage and job priorities

Resource Usage

Projects will be allocated a specific amount of compute hours or core hours for utilization. A single core hour represents the computational time of one core over one hour. For instance, if a full GPP node with 112 cores runs a job for one hour, it will consume 112 core hours from the allocated budget.

The accounting relies solely on the utilization of compute hours.

GPU Accounting

Accounting for jobs using GPU is calculated based on the number of CPU requested. A "real" hour of execution in a single GPU is accounted for as 20 CPU hours. This means a job that takes 12 hours to complete and requests two full ACC nodes (each 80 CPU and 4 GPU) will add 80 CPU × 2 nodes × 12 hours = 1920 hours.

Job Priority

Various factors determine a job's priority and subsequent scheduling, the most significant being job size, queue waiting time, and fair share among groups:

MareNostrum 5 is designed to prioritize and favour larger executions, giving higher priority to jobs utilizing more cores.
The waiting time in queues is also considered, and jobs progressively gain more priority the longer they wait.
Additionally, our queue system incorporates a fair-share policy among groups. Users with fewer executed jobs and consumed compute hours receive higher priority for their jobs compared to groups with increased usage. This ensures a fair distribution of computing time, allowing users to run jobs without favouring one group over another.

You can review your current fair-share score using the command:

sshare -la

Email notifications

These two directives are presented together as they should be utilized simultaneously. They activate email notifications, which are triggered when a job commences its execution (begin), concludes its execution (end), or both (all):
```
#SBATCH --mail-type={begin|end|all|none}
#SBATCH --mail-user={email}
```
Example (notified at the end of the job execution)
```
#SBATCH --mail-type=end
#SBATCH --mail-user=brucespringsteen@bsc.es
```
REMARKS
- The none option doesn't trigger any e-mail; it is the same as not putting the directives.
- The only requirement is that the specified email is valid and matches the one you use for the HPC User Portal (you may wonder, what is the HPC User Portal? An excellent question; you can find more information here).

MareNostrum 5

Running jobs​

Submitting jobs​

SBATCH commands​

Example​

Queues (QoS)​

Standard queues​

Special queues​

Job directives​

Resource reservation and mandatory directives​

Common resource directives​

Specific resource directives​

Example 1​

Example 2​

Working directory and job output/error files​

Performance configuration​

Job types​

Interactive jobs​

Examples​

Job arrays​

Example​

Jobs with GPU (Accelerated partition)​

Examples​

Heterogeneous jobs​

Job examples​

Sequential job​

Parallel jobs​

Other Slurm concepts​

Special considerations​

Some useful Slurm's environment variables​

Other environment variables​

Understanding job status and reason codes​

Job state codes​

Job reason codes​

Resource usage and job priorities​

Resource Usage​

GPU Accounting​

Job Priority​

Email notifications​

Example (notified at the end of the job execution)​

Running jobs

Submitting jobs

SBATCH commands

Example

Queues (QoS)

Standard queues

Special queues

Job directives

Resource reservation and mandatory directives

Common resource directives

Specific resource directives

Example 1

Example 2

Working directory and job output/error files

Performance configuration

Job types

Interactive jobs

Examples

Job arrays

Example

Jobs with GPU (Accelerated partition)

Examples

Heterogeneous jobs

Job examples

Sequential job

Parallel jobs

Other Slurm concepts

Special considerations

Some useful Slurm's environment variables

Other environment variables

Understanding job status and reason codes

Job state codes

Job reason codes

Resource usage and job priorities

Resource Usage

GPU Accounting

Job Priority

Email notifications

Example (notified at the end of the job execution)