Sometimes you have to share a machine with others. Ideally you’d use something like Slurm or Kubernetes to coordinate resource usage, but in one-off cases it may not be worth the hassle. If you can agree in advance who gets to use which resources on a node, all you need to do is make sure your processes obey these constraints.
Use nvidia-smi
to find out how many GPUs you have.
To limit the GPUs visible to a script train.py
, set the CUDA_VISIBLE_DEVICES
environment variable:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py
taskset
Check how many CPUs you have using nproc
or lscpu
. (Note: These tools give the number of logical CPUs, not physical CPUs. Typically we have two logical CPUs for each physical CPU, and these logical CPUs share some resources like cache which can have performance implications. Here we ignore physical CPUs - for taskset
, logical CPUs are what matter.)
To set the CPU affinity of a script train.py
, use the Linux utility taskset
. You can specify a range of CPUs or specific cores:
taskset -c 0-4 python train.py
taskset -c 0,1,2,3 python train.py
It’s also possible to change the affinity of an actively running job:
taskset -cp 0-4 <PID>
where the PID can be found with ps aux | grep train.py
.