Reports

✅ Best Fix:
In many cases, the issue is due to missing essential EKS add-ons. The first thing to try:

💡 Install these EKS add-ons via AWS Console or CLI:

vpc-cni
kube-proxy
CoreDNS

This resolves common causes like NetworkPluginNotReady or nodes stuck in NotReady state.

🧪 Still facing issues? Here's a structured troubleshooting guide:

1. Check Node Status

Run:

kubectl get nodes

Look for nodes in states like NotReady.

2. Inspect Node Conditions

For detailed info on an unhealthy node:

kubectl describe node <node-name>

Check for messages like CNI plugin failures, disk pressure, or kubelet issues.

3. Investigate Pod Status on That Node

kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>

This helps pinpoint pods causing resource issues or crashes.

4. Review Node Logs

If needed, SSH into the EC2 instance and check logs:

Kubelet logs:
```
journalctl -u kubelet
```

Container runtime (Docker/containerd):

systemctl status docker  # or containerd

General system logs:
```
/var/log/messages or /var/log/syslog
```

5. Check Resource Utilization

High CPU, memory, or disk usage can make nodes unhealthy:

kubectl top nodes

6. Common Root Causes

a) Resource Pressure

Scale node group or increase instance type.
Clean unused images:
```
docker system prune -a
```

b) CNI/Networking Issues

Check that vpc-cni is installed.
Verify VPC, subnet, and security group settings.
Ensure nodes can reach the control plane.

c) Kubelet or Runtime Failures

Restart kubelet:
```
systemctl restart kubelet
```
Restart Docker/containerd:
```
systemctl restart docker
```

d) IAM Role Misconfigurations

Ensure your NodeGroup has:

Proper IAM role with required policies.
Correct instance profile.

7. Consider Node Replacement

If the node doesn't recover:

Drain and remove it:

kubectl drain <node-name> --ignore-daemonsets --force

Terminate the EC2 instance — Auto Scaling will replace it.

8. Update Node Group AMI

If you're using a newer EKS version (e.g., v1.30), use the compatible Amazon Linux 2 AMI. Amazon Linux 2023 often causes issues with CNI and kubelet.

9. Enable Auto Repair (Optional)

Enable Node Auto Repair in your EKS settings for future automatic recovery.

📌 Bonus: What helps most is context.

If you’re still stuck, please share:

Output of kubectl get nodes
Any error messages from kubectl describe node
EKS version and NodeGroup AMI
Recent changes to your cluster setup

🔗 References:

79659532