This is an ugly one-liner that I use to calculate the maximum number of parallel jobs I can build flash-attn on a given nvidia/cuda machine. I've had to build on machines that were RAM-constrained and others that were CPU-constrained (ranging from Orin AGX 32GB to 128vCPU AMD with 1.5TB RAM and 8xH100).
Each job maxes out around 15GB of RAM, and each job will also max out around 4 threads. The build will likely crash if we go over RAM, but will just be slower if we go over threads. So I calculate the lesser of the two for max parallelization in the build.
YMMV, but this is the closest I've come to making this systematic as opposed to experiential.
export MAX_JOBS=$(expr $(free --giga | grep Mem: | awk '{print $2}') / 15); proc_jobs=$(expr $(nproc) / 4); echo $(($proc_jobs < $mem_jobs ? $proc_jobs : $mem_jobs)))