Maybe this passage can help you:
Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems
https://arxiv.org/abs/2502.05293