Running Jobs
Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the Cluster.
General considerations for MEEP FPGA cluster
Due to the particularities of this cluster, there are some considerations to have in mind when trying to allocate their resources. This section will briefly cover them, with further details in the following sections.
- Each FPGA node has 8 FPGAs, you are required to allocate all of them when requesting a node. You cannot allocate less than a full node due to the underlying infrastructure particularities of the FPGA setup.
- If you want to allocate FPGA nodes, the mandatory mechanism for it is the "--constraint=[dmaqdma , dmaxdma , dmanone]" Slurm option, instead of the usual "--gres" option of GPU clusters. Please check the job directives section for more information.
- If you want to allocate general purpose compute nodes, you are required to specify the "--partition=gpp" Slurm option. Please check the job directives section for more information.
Interactive Sessions
Allocation of an interactive session has to be done through SLURM:
- Interactive session on a general purpose node for 10 minutes, 1 task, 40 CPUs (cores) per task:
salloc -t 00:10:00 -n 1 -c 40 --partition=gpp [ -J {job_name} ]
- Interactive session on a FPGA node for 10 minutes, 1 node and using the dmaxdma constraint:
salloc -t 00:10:00 -N 1 --constraint=dmaxdma [ -J {job_name} ]
Also, X11 forwarding can be enabled in interactive session by adding the "--x11" flag:
salloc <slurm_options> --x11
Submitting jobs
The method for submitting jobs is to use the SLURM sbatch directives directly.
A job is the execution unit for SLURM. A job is defined by a text file containing a set of directives describing the job's requirements, and the commands to execute.
In order to ensure the proper scheduling of jobs, there are execution limitations in the number of nodes and cpus that cna be used at the same time by a group. You may check those limits using command 'bsc_queues'. If you need to run an execution bigger than the limits already granted, you may contact us.
SBATCH commands
These are the basic directives to submit jobs with sbatch:
sbatch <job_script>
Submits a “job script” to the queue system (see Job directives).
squeue
Shows all the submitted jobs.
scancel <job_id>
Remove the job from the queue system, canceling the execution of the processes, if they were still running.
salloc --x11
For an allocating salloc command, if the flag x11 is set the job will be handled as graphical (sets up X11 forwarding on the allocation) and you will be able to execute a graphical command. Meanwhile you do not close the current terminal you will get a graphical window.
Job directives
A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script and have to conform to either the sbatch syntaxes.
sbatch syxtax is of the form:
#SBATCH --directive=value
Additionally, the job script may contain a set of commands to execute. If not, an external script may be provided with the 'executable' directive. Here you may find the most common directives for both syntaxes:
#SBATCH --qos=debug
This partition is only intended for small tests.
#SBATCH --time=HH:MM:SS
The limit of wall clock time. This is a mandatory field and you must set it to a value greater than real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the time has passed.
#SBATCH -D pathname
The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.
#SBATCH --error=file
The name of the file to collect the standard error output (stderr) of the job.
#SBATCH --output=file
The name of the file to collect the standard output (stdout) of the job.
#SBATCH --partition=partition_name
Which partition to be used for the job. This cluster currently has two partitions, "main" and "gpp". If you need to use general purpose compute nodes, the partition to use is "gpp". The remaining nodes are allocated in the "main" partition.
#SBATCH --constraint=[dmaqdma, dmaxdma, dmanone]
Constraints the job execution to a subset of nodes with those features available. In this specific case, specifying one of these constraints is mandatory for using a FPGA node. Each constraint has the following effect:
- dmaqdma: enables the dmaqdma driver for the FPGA.
- dmaxdma: enables the dmaxdma driver for the FPGA.
- dmanone: doesn't enable any DMA driver.
#SBATCH --ntasks=number
The number of processes to start.
Optionally, you can specify how many threads each process would open with the directive:
#SBATCH --cpus-per-task=number
The number of cpus assigned to the job will be the total_tasks number * cpus_per_task number.
#SBATCH --ntasks-per-node=number
The number of tasks assigned to a node.
#SBATCH --exclusive
To request an exclusive use of a compute node without sharing the resources with other users.
#SBATCH --reservation=reservation_name
The reservation where your jobs will be allocated (assuming that your account has access to that reservation). In some ocasions, node reservations can be granted for executions where only a set of accounts can run jobs. Useful for courses.
Typical Slurm variables
When running a job, Slurm creates environment variables that can be accessible from within the job context. These variables can be useful for complex jobs where each task needs to know where is being run. Here is a brief list of the most basic environment variables:
Variable | Meaning |
---|---|
SLURM_JOBID | Specifies the job ID of the executing job |
SLURM_NPROCS | Specifies the total number of processes in the job |
SLURM_NNODES | Is the actual number of nodes assigned to run your job |
SLURM_PROCID | Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1) |
SLURM_NODEID | Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1) |
SLURM_LOCALID | Specifies the node-local task ID for the process within a job |
Examples
sbatch examples
The following jobs would be submitted using:
sbatch <your_jobscript>.sh
Example for a job on a single general purpose node using 40 cores:
#!/bin/bash
#SBATCH --job-name="test_gpp"
#SBATCH -D .
#SBATCH --output=gpp_%j.out
#SBATCH --error=gpp_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=40
#SBATCH --partition=gpp
#SBATCH --time=00:02:00
./binary > binary.out
Example for a parallel job on general purpose nodes, using 20 MPI ranks and 10 cores per rank:
#!/bin/bash
#SBATCH --job-name=test_parallel
#SBATCH -D .
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=20
#SBATCH --cpus-per-task=10
#SBATCH --time=00:02:00
mpirun ./parallel_binary> parallel.output
Example for a job on a single FPGA node using the dmaqdma driver:
#!/bin/bash
#SBATCH --job-name="test_gpp"
#SBATCH -D .
#SBATCH --output=gpp_%j.out
#SBATCH --error=gpp_%j.err
#SBATCH --nodes=1
#SBATCH --time=00:02:00
#SBATCH --constraint=dmaqdma
source /nfs/apps/XILINX/xilinx_22_env.sh
# Do FPGA things
Interpreting job status and reason codes
When using squeue, Slurm will report back the status of your launched jobs. If they are still waiting to enter execution, they will be followed by the reason. Slurm uses codes to display this information, so in this section we will be covering the meaning of the most relevant ones.
Job state codes
This list contains the usual state codes for jobs that have been submitted:
- COMPLETED (CD): The job has completed the execution.
- COMPLETING (CG): The job is finishing, but some processes are still active.
- FAILED (F): The job terminated with a non-zero exit code.
- PENDING (PD): The job is waiting for resource allocation. The most common state after running "sbatch", it will run eventually.
- PREEMPTED (PR): The job was terminated because of preemption by another job.
- RUNNING (R): The job is allocated and running.
- SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
- STOPPED (ST): A running job has been stopped with its cores retained.
Job reason codes
This list contains the most common reason codes of the jobs that have been submitted and are still not in the running state:
- Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
- Dependency: This job is waiting for a dependent job to complete and will run afterwards.
- Resources: The job is waiting for resources to become available and will eventually run.
- InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
- InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
- QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
- QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
- QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
- PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
- PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
- PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
- AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
- AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
- AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.