Skip to main content

Running Jobs

Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the Cluster.

General considerations for MEEP FPGA cluster

caution

Due to the particularities of this cluster, there are some considerations to have in mind when trying to allocate their resources. This section will briefly cover them, with further details in the following sections.

  1. Each FPGA node has 8 FPGAs, you are required to allocate all of them when requesting a node. You cannot allocate less than a full node due to the underlying infrastructure particularities of the FPGA setup.
  2. If you want to allocate FPGA nodes, the mandatory mechanism for it is the "--constraint=[dmaqdma , dmaxdma , dmanone]" Slurm option, instead of the usual "--gres" option of GPU clusters. Please check the job directives section for more information.
  3. If you want to allocate general purpose compute nodes, you are required to specify the "--partition=gpp" Slurm option. Please check the job directives section for more information.

Interactive Sessions

Allocation of an interactive session has to be done through SLURM:

  • Interactive session on a general purpose node for 10 minutes, 1 task, 40 CPUs (cores) per task:
salloc -t 00:10:00 -n 1 -c 40 --partition=gpp [ -J {job_name} ]
  • Interactive session on a FPGA node for 10 minutes, 1 node and using the dmaxdma constraint:
salloc -t 00:10:00 -N 1 --constraint=dmaxdma  [ -J {job_name} ]

Also, X11 forwarding can be enabled in interactive session by adding the "--x11" flag:

salloc <slurm_options> --x11

Submitting jobs

The method for submitting jobs is to use the SLURM sbatch directives directly.

A job is the execution unit for SLURM. A job is defined by a text file containing a set of directives describing the job's requirements, and the commands to execute.

In order to ensure the proper scheduling of jobs, there are execution limitations in the number of nodes and cpus that cna be used at the same time by a group. You may check those limits using command 'bsc_queues'. If you need to run an execution bigger than the limits already granted, you may contact us.

SBATCH commands

These are the basic directives to submit jobs with sbatch:

sbatch <job_script>

Submits a “job script” to the queue system (see Job directives).

squeue

Shows all the submitted jobs.

scancel <job_id>

Remove the job from the queue system, canceling the execution of the processes, if they were still running.

salloc --x11

For an allocating salloc command, if the flag x11 is set the job will be handled as graphical (sets up X11 forwarding on the allocation) and you will be able to execute a graphical command. Meanwhile you do not close the current terminal you will get a graphical window.

Job directives

A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script and have to conform to either the sbatch syntaxes.

sbatch syxtax is of the form:

#SBATCH --directive=value

Additionally, the job script may contain a set of commands to execute. If not, an external script may be provided with the 'executable' directive. Here you may find the most common directives for both syntaxes:

#SBATCH --qos=debug

This partition is only intended for small tests.

#SBATCH --time=HH:MM:SS

The limit of wall clock time. This is a mandatory field and you must set it to a value greater than real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the time has passed.

#SBATCH -D pathname

The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.

#SBATCH --error=file

The name of the file to collect the standard error output (stderr) of the job.

#SBATCH --output=file

The name of the file to collect the standard output (stdout) of the job.

#SBATCH --partition=partition_name

Which partition to be used for the job. This cluster currently has two partitions, "main" and "gpp". If you need to use general purpose compute nodes, the partition to use is "gpp". The remaining nodes are allocated in the "main" partition.

#SBATCH --constraint=[dmaqdma, dmaxdma, dmanone]

Constraints the job execution to a subset of nodes with those features available. In this specific case, specifying one of these constraints is mandatory for using a FPGA node. Each constraint has the following effect:

  • dmaqdma: enables the dmaqdma driver for the FPGA.
  • dmaxdma: enables the dmaxdma driver for the FPGA.
  • dmanone: doesn't enable any DMA driver.
#SBATCH --ntasks=number

The number of processes to start.

Optionally, you can specify how many threads each process would open with the directive:

#SBATCH --cpus-per-task=number

The number of cpus assigned to the job will be the total_tasks number * cpus_per_task number.

#SBATCH --ntasks-per-node=number

The number of tasks assigned to a node.

#SBATCH --exclusive

To request an exclusive use of a compute node without sharing the resources with other users.

#SBATCH --reservation=reservation_name

The reservation where your jobs will be allocated (assuming that your account has access to that reservation). In some ocasions, node reservations can be granted for executions where only a set of accounts can run jobs. Useful for courses.

Typical Slurm variables

When running a job, Slurm creates environment variables that can be accessible from within the job context. These variables can be useful for complex jobs where each task needs to know where is being run. Here is a brief list of the most basic environment variables:

VariableMeaning
SLURM_JOBIDSpecifies the job ID of the executing job
SLURM_NPROCSSpecifies the total number of processes in the job
SLURM_NNODESIs the actual number of nodes assigned to run your job
SLURM_PROCIDSpecifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)
SLURM_NODEIDSpecifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)
SLURM_LOCALIDSpecifies the node-local task ID for the process within a job

Examples

sbatch examples

The following jobs would be submitted using:

sbatch <your_jobscript>.sh

Example for a job on a single general purpose node using 40 cores:

#!/bin/bash
#SBATCH --job-name="test_gpp"
#SBATCH -D .
#SBATCH --output=gpp_%j.out
#SBATCH --error=gpp_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=40
#SBATCH --partition=gpp
#SBATCH --time=00:02:00

./binary > binary.out

Example for a parallel job on general purpose nodes, using 20 MPI ranks and 10 cores per rank:

#!/bin/bash
#SBATCH --job-name=test_parallel
#SBATCH -D .
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=20
#SBATCH --cpus-per-task=10
#SBATCH --time=00:02:00

mpirun ./parallel_binary> parallel.output

Example for a job on a single FPGA node using the dmaqdma driver:

#!/bin/bash
#SBATCH --job-name="test_gpp"
#SBATCH -D .
#SBATCH --output=gpp_%j.out
#SBATCH --error=gpp_%j.err
#SBATCH --nodes=1
#SBATCH --time=00:02:00
#SBATCH --constraint=dmaqdma

source /nfs/apps/XILINX/xilinx_22_env.sh
# Do FPGA things

Interpreting job status and reason codes

When using squeue, Slurm will report back the status of your launched jobs. If they are still waiting to enter execution, they will be followed by the reason. Slurm uses codes to display this information, so in this section we will be covering the meaning of the most relevant ones.

Job state codes

This list contains the usual state codes for jobs that have been submitted:

  • COMPLETED (CD): The job has completed the execution.
  • COMPLETING (CG): The job is finishing, but some processes are still active.
  • FAILED (F): The job terminated with a non-zero exit code.
  • PENDING (PD): The job is waiting for resource allocation. The most common state after running "sbatch", it will run eventually.
  • PREEMPTED (PR): The job was terminated because of preemption by another job.
  • RUNNING (R): The job is allocated and running.
  • SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
  • STOPPED (ST): A running job has been stopped with its cores retained.

Job reason codes

This list contains the most common reason codes of the jobs that have been submitted and are still not in the running state:

  • Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
  • Dependency: This job is waiting for a dependent job to complete and will run afterwards.
  • Resources: The job is waiting for resources to become available and will eventually run.
  • InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
  • InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
  • QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
  • QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
  • QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
  • PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
  • PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
  • PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
  • AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
  • AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
  • AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.