MareNostrum 5
Running jobs
Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the cluster.
Submitting jobs
The method for submitting jobs is to use the Slurm batch directives.
To access more detailed information:
man sbatch
man srun
man salloc
SBATCH commands
The basic Slurm directives for submitting and overseeing jobs using sbatch include the following (refer to Job directives for additional options):
Submit a job script to the queue system:
sbatch -A, --account={account} -q, --qos={qos} {job_script} [args]
Example
sbatch -A bsc21 -q acc_bsccase myjob.sh
WARNINGThe system will generate specific error messages if an attempt is made to submit a job without specifying a Slurm account and/or queue (QoS):
salloc: error: No account specified, please specify an account
salloc: error: Job submit/allocate failed: Unspecified errorsalloc: error: No QoS specified, please specify a QoS
salloc: error: Job submit/allocate failed: Unspecified errorINFOYou can get your available account/s (unixgroup) by running:
bsc_project list
And your QoS with:
bsc_queues
Display all submitted jobs (from all your current accounts/projects):
squeue
Display all submitted jobs from a specific account or project:
squeue -A, --account={account}
Remove a job from the queue system, canceling the execution of the processes (if they were still running):
scancel {jobid}
Queues (QoS)
Several queues are present in the machines, and users may access different queues. Queues have unlike limits regarding the number of cores and duration for the jobs.
Anytime you can check all queues you have access to and their limits by using:
bsc_queues
Standard queues
Standard queues (QoS) limits are as follows:
GPP:
Queue Max. number of nodes (cores) Wallclock Slurm QoS name Slurm partition name BSC 125 (14,000) 48h gp_bsc{case,cs,es,ls} gpp Debug 32 (3,584) 2h gp_debug gpp EuroHPC 800 (89,600) 72h gp_ehpc gpp HBM 50 (5,600) 72h gp_hbm hbm Interactive 1 (32) 2h gp_interactive gpinteractive RES Class A 200 (22,400) 72h gp_resa gpp RES Class B 200 (22,400) 48h gp_resb gpp RES Class C 50 (5,600) 24h gp_resc gpp Training 32 (3584) 24h gp_training gpp ACC:
Queue Max. number of nodes (cores) Wallclock Slurm QoS name Slurm partition name BSC 25 (2,000) 48h acc_bsc{case,cs,es,ls} acc Debug 8 (640) 2h acc_debug acc EuroHPC 100 (8,000) 72h acc_ehpc acc Interactive 1 (40) 2h acc_interactive accinteractive RES Class A 50 (4,000) 72h acc_resa acc RES Class B 50 (4,000) 48h acc_resb acc RES Class C 10 (800) 24h acc_resc acc Training 4 (320) 24h acc_training acc
Special queues
Additionally, special/extra queues can be provided for extended or larger executions, subject to demonstrating the scalability and performance of the application, taking into account factors such as demand and the current workload of the machine, among other conditions.
To request access to these specialized queues, please get in touch with us.
Interactive jobs
Allocation of an interactive session has to be done through Slurm:
salloc -A, --account={account} -q, --qos={qos} [ OPTIONS ]
Here are some of the parameters you can use with salloc (see also Job directives):
-J, --job-name={name}
-q, --qos={name}
-p, --partition={name}
-t, --time={time}
-n, --ntasks={number}
-c, --cpus-per-task={number}
-N, --nodes={number}
--exclusive
--x11
Examples
Request an interactive session for 10 minutes, 1 task, 4 CPUs (cores) per task, in the "GPP Interactive" partition (gpinteractive):
salloc -A bsc21 -t 00:10:00 -n 1 -c 4 -q gp_interactive -J myjob
Request an interactive session on a GPP compute node in exclusive mode (without sharing resources):
salloc -A bsc21 -q gp_debug --exclusive
salloc -A bsc21 -q gp_bsccase --exclusiveRequest an interactive session for 40 CPUs (cores) + 2 GPUs:
salloc -A bsc21 -q acc_bsccase -n 2 -c 20 --gres=gpu:2
Job directives
A job script must include a set of directives to convey the job's characteristics to the batch system. These directives appear as comments within the job script and must adhere to the syntax specified for sbatch. Additionally, the job script may contain a set of commands to execute.
sbatch syntax is of the form:
#SBATCH --directive=value
Below are some of the most frequently used directives.
Resource reservation
Request the queue for the job:
#SBATCH --qos={qos}
REMARK- Remember that this parameter is mandatory when submitting jobs in the current MareNostrum 5 configuration.
Set a time limit for the total runtime of the job:
#SBATCH --time={time} #DD-HH:MM:SS
caution- This field is mandatory, and you should set it to a value greater than the execution time needed for your application while staying within the time limit of the chosen queue.
- Please be aware that your job will be terminated once the specified time has elapsed.
Set the number of processes to start:
#SBATCH --ntasks={number}
Optionally, you can specify the number of threads each process will launch with the following directive:
#SBATCH --cpus-per-task={number}
REMARK- The total number of cores allocated to the job will be calculated as the product of ntasks and cpus-per-task.
Set the number of tasks assigned to a node:
#SBATCH --ntasks-per-node={number}
Set the number of tasks assigned to a socket:
#SBATCH --ntasks-per-socket={number}
Request exclusive use of assigned nodes without sharing resources with other users:
#SBATCH --exclusive
REMARKS- This condition applies only to jobs that request a single node.
- However, multi-node jobs will automatically utilize all requested nodes in exclusive mode.
Specify the number of nodes requested:
#SBATCH --nodes={number}
REMARK- Remember that if you request more than one node, this parameter will apply exclusivity to all nodes.
Example 1
For instance, let's say we specify only one node with this parameter, but we only want to use two tasks that use only one core each:
#SBATCH -N 1
#SBATCH -n 2
#SBATCH -c 1This will only request using the total resources, just two cores. The remaining resources of the used node will be left available for other users.
Example 2
If we request more than one node (keeping the other parameters unchanged), as follows:
#SBATCH -N 2
#SBATCH -n 2
#SBATCH -c 1It will demand exclusivity for both nodes, implying a request for the total resources of both nodes without sharing them with other users. In MareNostrum 5, each GPP node has 112 cores, so in this example, it will request 224 cores.
IMPORTANTIt's essential to be aware of this behaviour because it will allocate the CPU time of all 224 cores to your computation time budget (if applicable), even if you specify the usage of only two cores.
Sometimes, node reservations may be approved for executions exclusive to specific accounts, which is particularly beneficial for educational courses.
To specify the reservation name for job allocation, assuming your account has access to that reservation:
#SBATCH --reservation={name}
Choose a specific configuration, such as running the job on a 'HighMem' node:
#SBATCH --constraint=highmem
REMARKS- The accounting for one core hour on both standard and highmem nodes is identical, equating to 1 core-hour per core per hour allocated to your budget.
- Lacking this directive, jobs will be assigned to standard nodes with 2.3 GB of RAM per core. The availability of high-memory nodes is limited, with only 216 out of 6,408 nodes. Consequently, when requesting these nodes, expect significantly longer queueing times to fulfil the resource request before your job can run.
- To expedite queue turnaround times, you can choose normal nodes and reduce the number of processes per node. To achieve this, specify the
flag
#SBATCH --cpus-per-task=number
, and your budget will be charged for all requested cores.
Job arrays
Submit a job array for the execution of multiple jobs with identical parameters:
#SBATCH --array={indexes}
REMARKS- The index specification determines which array index values to utilize.
- Multiple values can be indicated using a comma-separated list and/or a range of values with a "-" separator.
Job arrays will include the configuration of two additional environment variables:
SLURM_ARRAY_JOB_ID
: will be assigned the initial job ID of the array.SLURM_ARRAY_TASK_ID
: will be assigned the job array index value.
Example
sbatch --array=1-3 job.cmd
Submitted batch job 36Will create a job array consisting of three jobs, and subsequently, the environment variables will be configured as follows:
# Job 1
SLURM_JOB_ID=36
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=1
# Job 2
SLURM_JOB_ID=37
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=2
# Job 3
SLURM_JOB_ID=38
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=3
Working directory and job output/error files
Establish the working directory for your job, indicating the location where the job will be executed:
#SBATCH --chdir={pathname}
caution- If not explicitly specified, it defaults to the current working directory when the job is submitted.
Specify the filenames to which to redirect the job's standard output (stdout) and standard error output (stderr):
#SBATCH --output={filename}
#SBATCH --error={filename}
Email notifications
These two directives are presented together as they should be utilized simultaneously. They activate email notifications, which are triggered when a job commences its execution (begin), concludes its execution (end), or both (all):
#SBATCH --mail-type={begin|end|all|none}
#SBATCH --mail-user={email}Example (notified at the end of the job execution)
#SBATCH --mail-type=end
#SBATCH --mail-user=brucespringsteen@bsc.esREMARKS- The
none
option doesn't trigger any e-mail; it is the same as not putting the directives. - The only requirement is that the specified email is valid and matches the one you use for the HPC User Portal (you may wonder, what is the HPC User Portal? An excellent question; you can find more information here).
- The
Performance configuration
Set the number of threads per core (SMT configuration):
#SBATCH --threads-per-core={1|2}
REMARK- Achieves the same outcome as using the
--hint={nomultithread|multithread}
option.
- Achieves the same outcome as using the
Indicate whether to utilize additional threads within in-core multi-threading (SMT configuration):
#SBATCH --hint={nomultithread|multithread}
REMARK- Produces identical results as employing the
--threads-per-core={1|2}
option.
- Produces identical results as employing the
Heterogeneous jobs
It's possible to run jobs with multiples components, which support independent partitions, accounts and queues. This may be useful for jobs containing diverse tasks that have better performance using a different number of cores, or jobs which contain both GPU accelerated and non accelerated tasks.
Heterogeneous jobs may be requested with both salloc and sbatch commands, following a slightly different structure:
Single line jobs must have both resources separated by a colon:
salloc -A bscXX --qos=gp_bench -n 112 --partition=gpp : --qos=acc_bench -n 80 --partition=acc --gres=gpu:4 [...]
Jobscripts also separate both parts, but now using a line that contains '#SBATCH hetjob':
...
#SBATCH --qos=gp_bsccs --cpus-per-task=112 --partition=gpp
#SBATCH hetjob
#SBATCH --qos=acc_bsccs --cpus-per-task=80 --partition=acc --gres=gpu:4
...
To use modules from a different partition you may modify the MODULEPATH environment variable, which contains the different paths where modulefiles are stored.
- Be careful not to overwrite your MODULEPATH variable, otherwise you won't be able to see or load the available modules.
For more detailed information visit the Slurm documentation.
Special considerations
Not following these directives will most likely result in your job failing.
- When using srun, the SLURM_CPUS_PER_TASK variable is no longer taken into consideration, as it also doesn't --cpus-per-task, so you have to either sxplicitly specify --cpus-per-task in your srun command, or set the following environment variable in your job (this only applies to srun, not mpirun):
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
- When using mpirun instead of srun, the SLURM_CPU_BIND variable must be set to "none":
export SLURM_CPU_BIND=none
- When using the Nvidia HPC SDK in the accelerated partition, MPI binaries must be run using mpirun, rather than srun. This is due to the Slurm inside the Nvidia SDK not being entirely compatible with Marenostrum5's Slurm configuration, causing it to fail.
Some useful Slurm's environment variables
Variable | Meaning |
---|---|
SLURM_JOBID | Specifies the job ID of the executing job |
SLURM_NPROCS | Specifies the total number of processes in the job |
SLURM_NNODES | Is the actual number of nodes assigned to run your job |
SLURM_PROCID | Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1) |
SLURM_NODEID | Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1) |
SLURM_LOCALID | Specifies the node-local task ID for the process within a job |
Job examples
Sequential job
Executing a sequential job:
#!/bin/bash
#SBATCH --job-name=seq_job
#SBATCH --chdir=.
#SBATCH --output=serial_%j.out
#SBATCH --error=serial_%j.err
#SBATCH --ntasks=1
#SBATCH --time=00:02:00
./serial_binary
Parallel jobs
Running an OpenMP job on a single node utilizing 112 cores within the 'gp_debug' queue:
#!/bin/bash
#SBATCH --job-name=omp_job
#SBATCH --chdir=.
#SBATCH --output=omp_%j.out
#SBATCH --error=omp_%j.err
#SBATCH --cpus-per-task=112
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --qos=gp_debug
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
./openmp_binaryRunning a pure MPI job on two complete nodes:
#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=224
srun ./mpi_binaryRunning a hybrid MPI+OpenMP job on two complete nodes with 56 MPI tasks (28 per node), each using 4 cores via OpenMP:
#!/bin/bash
#SBATCH --job-name=hybrid_job
#SBATCH --chdir=.
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=56
#SBATCH --cpus-per-task=4
#SBATCH --tasks-per-node=28
#SBATCH --time=00:02:00
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
srun ./hybrid_binaryRunning with GPUs on a complete ACC node:
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH -D .
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=80
#SBATCH --cpus-per-task=1
#SBATCH --time=00:02:00
#SBATCH --gres=gpu:4
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
srun ./gpu_binary
Understanding job status and reason codes
When using the squeue
command, Slurm provides information about the status of your submitted jobs. If a job is still waiting
before execution, it will be accompanied by a reason. Slurm employs specific codes to show this information, and the following section
explains the significance of the most relevant ones.
Job state codes
Common state codes for submitted jobs include:
- COMPLETED (CD): The job has completed the execution.
- COMPLETING (CG): The job is finishing, but some processes are still active.
- FAILED (F): The job terminated with a non-zero exit code.
- PENDING (PD): The job is waiting for resource allocation. The most common state after running
sbatch
, it will run eventually. - PREEMPTED (PR): The job was terminated because of preemption by another job.
- RUNNING (R): The job is allocated and running.
- SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
- STOPPED (ST): A running job has been stopped with its cores retained.
Job reason codes
The following list outlines the most frequently encountered reason codes for jobs that have been submitted but have not yet entered the running state:
- Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
- Dependency: This job is waiting for a dependent job to complete and will run afterwards.
- Resources: The job is waiting for resources to become available and will eventually run.
- InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
- InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
- QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
- QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
- QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
- PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
- PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
- PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
- AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
- AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
- AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.
Resource usage and job priorities
Projects will be allocated a specific amount of compute hours or core hours for utilization. A single core hour represents the computational time of one core over one hour. For instance, if a full GPP node with 112 cores runs a job for one hour, it will consume 112 core hours from the allocated budget.
The accounting relies solely on the utilization of compute hours.
Various factors determine a job's priority and subsequent scheduling, the most significant being job size, queue waiting time, and fair share among groups:
- MareNostrum 5 is designed to prioritize and favour larger executions, giving higher priority to jobs utilizing more cores.
- The waiting time in queues is also considered, and jobs progressively gain more priority the longer they wait.
- Additionally, our queue system incorporates a fair-share policy among groups. Users with fewer executed jobs and consumed compute hours receive higher priority for their jobs compared to groups with increased usage. This ensures a fair distribution of computing time, allowing users to run jobs without favouring one group over another.
You can review your current fair-share score using the command:
sshare -la
Notifications
Currently, receiving email notifications about job status is not supported. To monitor the execution or completion of your jobs, you must connect to the system and manually check their status. Automatic notifications are planned to be enabled in the future.