79198797

Date: 2024-11-18 05:48:06
Score: 1
Natty:
Report link

I am having this same issue. for about a week now, I cannot figure it out. I have 6 slurm nodes. 2 are working find. But the 4 give same error:

sudo slurmd -Dvvvv give me the below error slurmd: error: Error binding slurm stream socket: Address already in use slurmd: fatal: Unable to bind listen port (6818): Address already in use

The slurmd service is running.

sudo lsof -i :6818 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME slurmd 1312724 root 5u IPv4 12836747 0t0 TCP *:6818 (LISTEN)

If I kill the process it error goes away for a few minutes and the node is in a state of idle. But a few minutes later, the node is down.

I rand a sudo killAll slurmd with no success All the 4 nodes behave same way except the controller and the first worker node on the cluster.

I have since added a port range thinking that its having a port conflict scontrol show config | grep SrunPortRange SrunPortRange = 60001-63000

So something is blocking the port or the process

I tried all the suggestions above and non has helped

Reasons:
  • Blacklisted phrase (0.5): I cannot
  • Long answer (-1):
  • No code block (0.5):
  • Low reputation (1):
Posted by: Hippolyte Asah