Skip to main content

MareNostrum 5

Running jobs

Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the cluster.

Submitting jobs

The method for submitting jobs is to use the Slurm batch directives.

To access more detailed information:

man sbatch
man srun
man salloc

SBATCH commands

The basic Slurm directives for submitting and overseeing jobs using sbatch include the following (refer to Job directives for additional options):

  • Submit a job script to the queue system:

    sbatch -A, --account={account} -q, --qos={qos} {job_script} [args]
    Example
    sbatch -A bsc21 -q acc_bsccase myjob.sh
    WARNING

    The system will generate specific error messages if an attempt is made to submit a job without specifying a Slurm account and/or queue (QoS):

    salloc: error: No account specified, please specify an account
    salloc: error: Job submit/allocate failed: Unspecified error
    salloc: error: No QoS specified, please specify a QoS
    salloc: error: Job submit/allocate failed: Unspecified error
  • Display all submitted jobs (from all your current accounts/projects):

    squeue
  • Display all submitted jobs from a specific account or project:

    squeue -A, --account={account}
  • Remove a job from the queue system, canceling the execution of the processes (if they were still running):

    scancel {jobid}

Queues (QoS)

Several queues are present in the machines, and users may access different queues. Queues have unlike limits regarding the number of cores and duration for the jobs.

Anytime you can check all queues you have access to and their limits by using:

bsc_queues
Standard queues

Standard queues (QoS) limits are as follows:

  • GPP:

    QueueMax. number of nodes (cores)WallclockSlurm QoS nameSlurm partition name
    BSC125 (14,000)48hgp_bsc{case,cs,es,ls}gpp
    Debug32 (3,584)2hgp_debuggpp
    EuroHPC800 (89,600)72hgp_ehpcgpp
    HBM50 (5,600)72hgp_hbmhbm
    Interactive1 (32)2hgp_interactivegpinteractive
    RES Class A200 (22,400)72hgp_resagpp
    RES Class B200 (22,400)48hgp_resbgpp
    RES Class C50 (5,600)24hgp_rescgpp
    Training32 (3584)24hgp_traininggpp
  • ACC:

    QueueMax. number of nodes (cores)WallclockSlurm QoS nameSlurm partition name
    BSC25 (2,000)48hacc_bsc{case,cs,es,ls}acc
    Debug8 (640)2hacc_debugacc
    EuroHPC100 (8,000)72hacc_ehpcacc
    Interactive1 (40)2hacc_interactiveaccinteractive
    RES Class A50 (4,000)72hacc_resaacc
    RES Class B50 (4,000)48hacc_resbacc
    RES Class C10 (800)24hacc_rescacc
    Training4 (320)24hacc_trainingacc

Special queues

Additionally, special/extra queues can be provided for extended or larger executions, subject to demonstrating the scalability and performance of the application, taking into account factors such as demand and the current workload of the machine, among other conditions.

To request access to these specialized queues, please get in touch with us.

Interactive jobs

Allocation of an interactive session has to be done through Slurm:

salloc -A, --account={account} -q, --qos={qos} [ OPTIONS ]

Here are some of the parameters you can use with salloc (see also Job directives):

-J, --job-name={name}
-q, --qos={name}
-p, --partition={name}

-t, --time={time}
-n, --ntasks={number}
-c, --cpus-per-task={number}
-N, --nodes={number}

--exclusive
--x11
Examples
  • Request an interactive session for 10 minutes, 1 task, 4 CPUs (cores) per task, in the "GPP Interactive" partition (gpinteractive):

    salloc -A bsc21 -t 00:10:00 -n 1 -c 4 -q gp_interactive -J myjob
  • Request an interactive session on a GPP compute node in exclusive mode (without sharing resources):

    salloc -A bsc21 -q gp_debug --exclusive
    salloc -A bsc21 -q gp_bsccase --exclusive
  • Request an interactive session for 40 CPUs (cores) + 2 GPUs:

    salloc -A bsc21 -q acc_bsccase -n 2 -c 20 --gres=gpu:2

Job directives

A job script must include a set of directives to convey the job's characteristics to the batch system. These directives appear as comments within the job script and must adhere to the syntax specified for sbatch. Additionally, the job script may contain a set of commands to execute.

sbatch syntax is of the form:

#SBATCH --directive=value

Below are some of the most frequently used directives.

Resource reservation

  • Request the queue for the job:

    #SBATCH --qos={qos}
    REMARK
    • Remember that this parameter is mandatory when submitting jobs in the current MareNostrum 5 configuration.
  • Set a time limit for the total runtime of the job:

    #SBATCH --time={time} #DD-HH:MM:SS
    caution
    • This field is mandatory, and you should set it to a value greater than the execution time needed for your application while staying within the time limit of the chosen queue.
    • Please be aware that your job will be terminated once the specified time has elapsed.
  • Set the number of processes to start:

    #SBATCH --ntasks={number}
  • Optionally, you can specify the number of threads each process will launch with the following directive:

    #SBATCH --cpus-per-task={number}
    REMARK
    • The total number of cores allocated to the job will be calculated as the product of ntasks and cpus-per-task.
  • Set the number of tasks assigned to a node:

    #SBATCH --ntasks-per-node={number}
  • Set the number of tasks assigned to a socket:

    #SBATCH --ntasks-per-socket={number}
  • Request exclusive use of assigned nodes without sharing resources with other users:

    #SBATCH --exclusive
    REMARKS
    • This condition applies only to jobs that request a single node.
    • However, multi-node jobs will automatically utilize all requested nodes in exclusive mode.
  • Specify the number of nodes requested:

    #SBATCH --nodes={number}
    REMARK
    • Remember that if you request more than one node, this parameter will apply exclusivity to all nodes.
    Example 1

    For instance, let's say we specify only one node with this parameter, but we only want to use two tasks that use only one core each:

    #SBATCH -N 1
    #SBATCH -n 2
    #SBATCH -c 1

    This will only request using the total resources, just two cores. The remaining resources of the used node will be left available for other users.

    Example 2

    If we request more than one node (keeping the other parameters unchanged), as follows:

    #SBATCH -N 2
    #SBATCH -n 2
    #SBATCH -c 1

    It will demand exclusivity for both nodes, implying a request for the total resources of both nodes without sharing them with other users. In MareNostrum 5, each GPP node has 112 cores, so in this example, it will request 224 cores.

    IMPORTANT

    It's essential to be aware of this behaviour because it will allocate the CPU time of all 224 cores to your computation time budget (if applicable), even if you specify the usage of only two cores.

  • Sometimes, node reservations may be approved for executions exclusive to specific accounts, which is particularly beneficial for educational courses.

    To specify the reservation name for job allocation, assuming your account has access to that reservation:

    #SBATCH --reservation={name}
  • Choose a specific configuration, such as running the job on a 'HighMem' node:

    #SBATCH --constraint=highmem
    REMARKS
    • The accounting for one core hour on both standard and highmem nodes is identical, equating to 1 core-hour per core per hour allocated to your budget.
    • Lacking this directive, jobs will be assigned to standard nodes with 2.3 MB of RAM per core. The availability of high-memory nodes is limited, with only 216 out of 6,408 nodes. Consequently, when requesting these nodes, expect significantly longer queueing times to fulfil the resource request before your job can run.
    • To expedite queue turnaround times, you can choose normal nodes and reduce the number of processes per node. To achieve this, specify the flag #SBATCH --cpus-per-task=number, and your budget will be charged for all requested cores.

Job arrays

  • Submit a job array for the execution of multiple jobs with identical parameters:

    #SBATCH --array={indexes}
    REMARKS
    • The index specification determines which array index values to utilize.
    • Multiple values can be indicated using a comma-separated list and/or a range of values with a "-" separator.

    Job arrays will include the configuration of two additional environment variables:

    1. SLURM_ARRAY_JOB_ID: will be assigned the initial job ID of the array.
    2. SLURM_ARRAY_TASK_ID: will be assigned the job array index value.
    Example
    sbatch --array=1-3 job.cmd
    Submitted batch job 36

    Will create a job array consisting of three jobs, and subsequently, the environment variables will be configured as follows:

    # Job 1
    SLURM_JOB_ID=36
    SLURM_ARRAY_JOB_ID=36
    SLURM_ARRAY_TASK_ID=1

    # Job 2
    SLURM_JOB_ID=37
    SLURM_ARRAY_JOB_ID=36
    SLURM_ARRAY_TASK_ID=2

    # Job 3
    SLURM_JOB_ID=38
    SLURM_ARRAY_JOB_ID=36
    SLURM_ARRAY_TASK_ID=3

Working directory and job output/error files

  • Establish the working directory for your job, indicating the location where the job will be executed:

    #SBATCH --chdir={pathname}
    caution
    • If not explicitly specified, it defaults to the current working directory when the job is submitted.
  • Specify the filenames to which to redirect the job's standard output (stdout) and standard error output (stderr):

    #SBATCH --output={filename}
    #SBATCH --error={filename}

Email notifications

  • These two directives are presented together as they should be utilized simultaneously. They activate email notifications, which are triggered when a job commences its execution (begin), concludes its execution (end), or both (all):

    #SBATCH --mail-type={begin|end|all|none}
    #SBATCH --mail-user={email}
    Example (notified at the end of the job execution)
    #SBATCH --mail-type=end
    #SBATCH --mail-user=brucespringsteen@bsc.es
    REMARKS
    • The none option doesn't trigger any e-mail; it is the same as not putting the directives.
    • The only requirement is that the specified email is valid and matches the one you use for the HPC User Portal (you may wonder, what is the HPC User Portal? An excellent question; you can find more information here).

Performance configuration

  • Set the number of threads per core (SMT configuration):

    #SBATCH  --threads-per-core={1|2}
    REMARK
    • Achieves the same outcome as using the --hint={nomultithread|multithread} option.
  • Indicate whether to utilize additional threads within in-core multi-threading (SMT configuration):

    #SBATCH  --hint={nomultithread|multithread}
    REMARK
    • Produces identical results as employing the --threads-per-core={1|2} option.

Some useful Slurm's environment variables

VariableMeaning
SLURM_JOBIDSpecifies the job ID of the executing job
SLURM_NPROCSSpecifies the total number of processes in the job
SLURM_NNODESIs the actual number of nodes assigned to run your job
SLURM_PROCIDSpecifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)
SLURM_NODEIDSpecifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)
SLURM_LOCALIDSpecifies the node-local task ID for the process within a job

Job examples

Sequential job

  • Executing a sequential job:

    #!/bin/bash
    #SBATCH --job-name=seq_job
    #SBATCH --chdir=.
    #SBATCH --output=serial_%j.out
    #SBATCH --error=serial_%j.err
    #SBATCH --ntasks=1
    #SBATCH --time=00:02:00

    ./serial_binary

Parallel jobs

  • Running an OpenMP job on a single node utilizing 112 cores within the 'gp_debug' queue:

    #!/bin/bash
    #SBATCH --job-name=omp_job
    #SBATCH --chdir=.
    #SBATCH --output=omp_%j.out
    #SBATCH --error=omp_%j.err
    #SBATCH --cpus-per-task=112
    #SBATCH --ntasks=1
    #SBATCH --time=00:10:00
    #SBATCH --qos=gp_debug

    export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

    ./openmp_binary
  • Running a pure MPI job on two complete nodes:

    #!/bin/bash
    #SBATCH --job-name=mpi_job
    #SBATCH --output=mpi_%j.out
    #SBATCH --error=mpi_%j.err
    #SBATCH --ntasks=224

    srun ./mpi_binary
  • Running a hybrid MPI+OpenMP job on two complete nodes with 56 MPI tasks (28 per node), each using 4 cores via OpenMP:

    #!/bin/bash
    #SBATCH --job-name=hybrid_job
    #SBATCH --chdir=.
    #SBATCH --output=mpi_%j.out
    #SBATCH --error=mpi_%j.err
    #SBATCH --ntasks=56
    #SBATCH --cpus-per-task=4
    #SBATCH --tasks-per-node=28
    #SBATCH --time=00:02:00

    export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

    srun ./hybrid_binary
  • Running with GPUs on a complete ACC node:

    #!/bin/bash
    #SBATCH --job-name=gpu_job
    #SBATCH -D .
    #SBATCH --output=mpi_%j.out
    #SBATCH --error=mpi_%j.err
    #SBATCH --ntasks=80
    #SBATCH --cpus-per-task=2
    #SBATCH --time=00:02:00
    #SBATCH --gres=gpu:4

    export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

    srun ./gpu_binary

Understanding job status and reason codes

When using the squeue command, Slurm provides information about the status of your submitted jobs. If a job is still waiting before execution, it will be accompanied by a reason. Slurm employs specific codes to show this information, and the following section explains the significance of the most relevant ones.

Job state codes

Common state codes for submitted jobs include:

  • COMPLETED (CD): The job has completed the execution.
  • COMPLETING (CG): The job is finishing, but some processes are still active.
  • FAILED (F): The job terminated with a non-zero exit code.
  • PENDING (PD): The job is waiting for resource allocation. The most common state after running sbatch, it will run eventually.
  • PREEMPTED (PR): The job was terminated because of preemption by another job.
  • RUNNING (R): The job is allocated and running.
  • SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
  • STOPPED (ST): A running job has been stopped with its cores retained.

Job reason codes

The following list outlines the most frequently encountered reason codes for jobs that have been submitted but have not yet entered the running state:

  • Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
  • Dependency: This job is waiting for a dependent job to complete and will run afterwards.
  • Resources: The job is waiting for resources to become available and will eventually run.
  • InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
  • InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
  • QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
  • QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
  • QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
  • PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
  • PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
  • PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
  • AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
  • AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
  • AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.

Resource usage and job priorities

Projects will be allocated a specific amount of compute hours or core hours for utilization. A single core hour represents the computational time of one core over one hour. For instance, if a full GPP node with 112 cores runs a job for one hour, it will consume 112 core hours from the allocated budget.

The accounting relies solely on the utilization of compute hours.

Various factors determine a job's priority and subsequent scheduling, the most significant being job size, queue waiting time, and fair share among groups:

  • MareNostrum 5 is designed to prioritize and favour larger executions, giving higher priority to jobs utilizing more cores.
  • The waiting time in queues is also considered, and jobs progressively gain more priority the longer they wait.
  • Additionally, our queue system incorporates a fair-share policy among groups. Users with fewer executed jobs and consumed compute hours receive higher priority for their jobs compared to groups with increased usage. This ensures a fair distribution of computing time, allowing users to run jobs without favouring one group over another.

You can review your current fair-share score using the command:

sshare -la

Notifications

Currently, receiving email notifications about job status is not supported. To monitor the execution or completion of your jobs, you must connect to the system and manually check their status. Automatic notifications are planned to be enabled in the future.