β
Best Fix:
In many cases, the issue is due to missing essential EKS add-ons. The first thing to try:
π‘ Install these EKS add-ons via AWS Console or CLI:
vpc-cni
kube-proxy
CoreDNS
This resolves common causes like NetworkPluginNotReady
or nodes stuck in NotReady
state.
π§ͺ Still facing issues? Here's a structured troubleshooting guide:
Run:
kubectl get nodes
Look for nodes in states like NotReady
.
For detailed info on an unhealthy node:
kubectl describe node <node-name>
Check for messages like CNI plugin failures, disk pressure, or kubelet issues.
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>
This helps pinpoint pods causing resource issues or crashes.
If needed, SSH into the EC2 instance and check logs:
Kubelet logs:
journalctl -u kubelet
Container runtime (Docker/containerd):
systemctl status docker # or containerd
General system logs:
/var/log/messages or /var/log/syslog
High CPU, memory, or disk usage can make nodes unhealthy:
kubectl top nodes
Scale node group or increase instance type.
Clean unused images:
docker system prune -a
Check that vpc-cni
is installed.
Verify VPC, subnet, and security group settings.
Ensure nodes can reach the control plane.
Restart kubelet:
systemctl restart kubelet
Restart Docker/containerd:
systemctl restart docker
Ensure your NodeGroup has:
Proper IAM role with required policies.
Correct instance profile.
If the node doesn't recover:
Drain and remove it:
kubectl drain <node-name> --ignore-daemonsets --force
Terminate the EC2 instance β Auto Scaling will replace it.
If you're using a newer EKS version (e.g., v1.30), use the compatible Amazon Linux 2 AMI. Amazon Linux 2023 often causes issues with CNI and kubelet.
Enable Node Auto Repair in your EKS settings for future automatic recovery.
π Bonus: What helps most is context.
If youβre still stuck, please share:
Output of kubectl get nodes
Any error messages from kubectl describe node
EKS version and NodeGroup AMI
Recent changes to your cluster setup
π References: