Requesting Resources¶
Partitions¶
Each node -- physically distinct machines within the cluster -- will be a member of one or more partitions. A partition consists of a collection of nodes, a policy for job scheduling on that partition, a policy for conflicts when nodes are a member of more than one partition (ie. preemption), and a policy for managing and restricting resources per user or per group referred to as Quality of Service. The Slurm documentation has detailed information on how preemption and QOS definitions are handled; our per-cluster Resources sections describe how partitions are organized and preemption handled on our clusters.
Accounts¶
Users are granted access to resources via Slurm associations. An association links together a user with an account and a QOS definition. Accounts most commonly correspond to your lab, but sometimes exist for graduate groups, departments, or institutes.
To see your associations, and thus which accounts and partitions you have access to, you can use the
sacctmgr
command:
$ sacctmgr show assoc user=$USER
Cluster Account User Partition Share ... MaxTRESMins QOS Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ... ------------- -------------------- --------- -------------
franklin hpccfgrp camw mmgdept-g+ 1 ... hpccfgrp-mmgdept-gp+
franklin hpccfgrp camw mmaldogrp+ 1 ... hpccfgrp-mmaldogrp-+
franklin hpccfgrp camw cashjngrp+ 1 ... hpccfgrp-cashjngrp-+
franklin hpccfgrp camw jalettsgr+ 1 ... hpccfgrp-jalettsgrp+
franklin hpccfgrp camw jawdatgrp+ 1 ... hpccfgrp-jawdatgrp-+
franklin hpccfgrp camw low 1 ... hpccfgrp-low-qos
franklin hpccfgrp camw high 1 ... hpccfgrp-high-qos
franklin jawdatgrp camw low 1 ... mcbdept-low-qos
franklin jawdatgrp camw jawdatgrp+ 1 ... jawdatgrp-jawdatgrp+
franklin jalettsgrp camw jalettsgr+ 1 ... jalettsgrp-jalettsg+
franklin jalettsgrp camw low 1 ... mcbdept-low-qos
The output is very wide, so you may want to pipe it through less
to make it more readable:
Or, perhaps preferably, output it in a more compact format:
$ sacctmgr show assoc user=$USER format="account%20,partition%20,qos%40"
Account Partition QOS
-------------------- -------------------- ----------------------------------------
hpccfgrp mmgdept-gpu hpccfgrp-mmgdept-gpu-qos
hpccfgrp mmaldogrp-gpu hpccfgrp-mmaldogrp-gpu-qos
hpccfgrp cashjngrp-gpu hpccfgrp-cashjngrp-gpu-qos
hpccfgrp jalettsgrp-gpu hpccfgrp-jalettsgrp-gpu-qos
hpccfgrp jawdatgrp-gpu hpccfgrp-jawdatgrp-gpu-qos
hpccfgrp low hpccfgrp-low-qos
hpccfgrp high hpccfgrp-high-qos
jawdatgrp low mcbdept-low-qos
jawdatgrp jawdatgrp-gpu jawdatgrp-jawdatgrp-gpu-qos
jalettsgrp jalettsgrp-gpu jalettsgrp-jalettsgrp-gpu-qos
jalettsgrp low mcbdept-low-qos
In the above example, we can see that user camw
has access to the high
partition via an
association with hpccfgrp
and the jalettsgrp-gpu
partition via the jalettsgrp
account.
Resource Types¶
CPUs / cores¶
CPUs are the central compute power behind your jobs. Most scientific software supports multiprocessing (multiple instances of an executable with discrete memory resources, possibly but not necessarily communicating with each other), multithreading (multiple paths, or threads, of execution within a process on a node, sharing the same memory resources, but able to execute on different cores), or both. This allows computation to scale with increased numbers of CPUs, allowing bigger datasets to be analyzed.
Slurm's CPU management methods are complex and can quickly become confusing. For the purposes of this documentation, we will provide a simplified explanation; those with advanced needs should consult the Slurm documentation.
Slurm follows a distinction between its physical resources -- cluster nodes and CPUs or cores on a node -- and virtual resources, or tasks, which specify how requested physical resources will be grouped and distributed. By default, Slurm will minimize the number of nodes allocated to a job, and attempt to keep the job's CPU requests localized within a node. Tasks group together CPUs (or other resources): CPUs within a task will be kept together on the same node. Different tasks may end up on different nodes, but Slurm will exhaust the CPUs on a given node before splitting tasks between nodes unless specifically requested.
A Complication: SMT / Hyperthreading
Slurm understands the distinction between physical and logical cores. Most modern CPUs support Simultaneous Multithreading (SMT), which allows multiple independent processes to run on a single physical core. Although each of these is not a full fledged core, they have independent hardware for certain operations, and can greatly improve scalability for some tasks. However, using an individual thread within a single core makes little sense, as it shares hardware with the other SMT threads on its core; so, Slurm will always keep these threads together. In practice, this means if you ask for an odd number of CPUs, your request will be rounded up so as not to split an SMT thread between different job allocations.
The primary parameters controlling these are:
--cpus-per-task/-c
: How many CPUs to request per task. The number of CPUs requested here will always be on the same node. By default, 1.--ntasks/-n
: The number of tasks to request. By default, 1.--nodes/-N
: The minimum number of nodes to request, by default, 1.
Let's explore some examples. The simple request would be to ask for 2 CPUs.
We will use srun
to request resources and then immediately run the nproc
command within the allocation
to report how many CPUs are available:
$ srun -c 2 nproc
srun: job 682 queued and waiting for resources
srun: job 682 has been allocated resources
2
We asked for 2 CPUs per task, and Slurm gave us 2 CPUs and 1 task. What happens if we ask for 2 tasks instead of 2 CPUs?
$ srun -n 2 nproc
srun: job 683 queued and waiting for resources
srun: job 683 has been allocated resources
1
1
This time, we were given 2 separate tasks, each of which got 1 CPU.
Each task ran its own instance of the nproc
command, and so each reported 1
.
If we ask for more CPUs per task:
$ srun -n 2 -c 2 nproc
srun: job 684 queued and waiting for resources
srun: job 684 has been allocated resources
2
2
We still asked for 2 tasks, but this time we requested 2 CPUs in each.
So, we got 2 instances of nproc
, each reported 2
CPUs in their task.
Summary
If you want to run multithreaded jobs, use --cpus-per-task N_THREADS
and -ntasks 1
.
If you want a multiprocess job (or an MPI job), increase -ntasks
.
The SMT Edge Case
If we use -c 1
without specifying the number of tasks, we might be taken by surprise:
$ srun -c 1 nproc
srun: job 685 queued and waiting for resources
srun: job 685 has been allocated resources
1
1
We only asked for 1 CPU per task, but we got 2 tasks! This is due to SMT, described in the note above. Because Slurm will not split SMT threads, and there are 2 SMT threads per physical core, the request was rounded up to 2 CPUs total. In order to keep with the 1 CPU-per-task constraint, it spawned 2 tasks. Similarly, if we specify that we only want 1 task, CPUs per task will instead be bumped:
Nodes¶
Let's explore multiple nodes a bit further.
We have seen previously that the -n/ntasks
parameter will allocate discrete groups of cores.
In our prior examples, however, we used small resource requests.
What happens when we want to distribute jobs across nodes?
Slurm uses the block distribution method by default to distribute tasks between nodes. It will exhaust all the CPUs on a node with task groups before moving to a new node. For these examples, we're going to create a script that reports both the hostname (ie, the node) and the number of CPUs:
And make it executable with chmod +x host-nprocs.sh
.
Now let's make a multiple-task request:
$ srun -c 2 -n 2 ./host-nprocs.sh
srun: job 691 queued and waiting for resources
srun: job 691 has been allocated resources
c-8-42: 2
c-8-42: 2
As before, we asked for 2 tasks and 2 CPUs per task.
Both tasks were assigned to c-8-42
, because it had enough CPUs to fulfill the request.
What if it did not?
$ srun -c 120 -n 3 ./host-nprocs.sh
srun: job 692 queued and waiting for resources
srun: job 692 has been allocated resources
c-8-42: 120
c-8-42: 120
c-8-50: 120
This time, we asked for 3 tasks each with 120 CPUs.
The first two tasks were able to be fulfilled by the node c-8-42
, but that node did not have enough
CPUs to allocate another 120 on top of that.
So, the third task was distributed to c-8-50
.
Thus, this task spanned multiple nodes.
Sometimes, we want to make sure each task has its own node.
We can achieve this with the --nodes/-N
parameter.
This specifies the minimum number of nodes the tasks should be allocated across.
If we rerun the above example:
$ srun -c 120 -n 3 -N 3 ./host-nprocs.sh
srun: job 693 queued and waiting for resources
srun: job 693 has been allocated resources
c-8-42: 120
c-8-50: 120
c-8-58: 120
We still asked for 3 tasks and 3 CPUs per task, but this time we specified we wanted a minimum of 3 nodes.
As a result, we were allocated portions of c-8-42
, c-8-50
, and c-8-58
.
RAM / Memory¶
Random Access Memory (RAM) is the fast, volatile storage that your programs use to store data during execution. This can be contrasted with disk storage, which is non-volatile and many orders of magnitude slower to access, and is used for long term data -- say, your sequencing runs or cryo-EM images. RAM is a limited resource on each node, so Slurm enforces memory limits for jobs using cgroups. If a job step consumes more RAM than requested, the step will be killed.
Some (mutually exclusive) parameters for requesting RAM are:
--mem
: The memory required per-node. Usually, you want to use--mem-per-cpu
.--mem-per-cpu
: The memory required per CPU or core. If you requested \(N\) tasks, \(C\) CPUs per task, and \(M\) memory per CPU, your total memory usage will be \(N * C * M\). Note that, if \(N \gt 1\), you will have \(N\) discrete \(C * M\)-sized chunks of RAM requested, possibly across different nodes.--mem-per-gpu
: Memory required per GPU, which will scale with GPUs in the same way as--mem-per-cpu
will with CPUs.
For all memory requests, units can be specified explicitly with the suffixes [K|M|G|T]
for [kilobytes|megabytes|gigabytes|terabytes]
,
with the default units being M
/megabytes
.
So, --mem-per-cpu=500
will requested 500 megabytes of RAM per CPU, and --mem-per-cpu=32G
will request 32 gigabytes
of RAM per CPU.
Here is an example of a task overrunning its memory allocation.
We will use the stress-ng
program to allocate 8 gigabytes of RAM in a job that only requested 200 megabytes.
$ srun -n 1 --cpus-per-task 2 --mem-per-cpu 200M stress-ng -m 1 --vm-bytes 8G --oomable 1 ↵
srun: job 706 queued and waiting for resources
srun: job 706 has been allocated resources
stress-ng: info: [3037475] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info: [3037475] dispatching hogs: 1 vm
stress-ng: info: [3037475] successful run completed in 2.23s
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=706.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: c-8-42: task 0: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=706.0