I am having this same issue. for about a week now, I cannot figure it out. I have 6 slurm nodes. 2 are working find. But the 4 give same error:
sudo slurmd -Dvvvv give me the below error slurmd: error: Error binding slurm stream socket: Address already in use slurmd: fatal: Unable to bind listen port (6818): Address already in use
The slurmd service is running.
sudo lsof -i :6818 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME slurmd 1312724 root 5u IPv4 12836747 0t0 TCP *:6818 (LISTEN)
If I kill the process it error goes away for a few minutes and the node is in a state of idle. But a few minutes later, the node is down.
I rand a sudo killAll slurmd with no success All the 4 nodes behave same way except the controller and the first worker node on the cluster.
I have since added a port range thinking that its having a port conflict scontrol show config | grep SrunPortRange SrunPortRange = 60001-63000
So something is blocking the port or the process
I tried all the suggestions above and non has helped