I've worked with the gemma model and its quantization in the past, as per my investigation/ experimentation regarding this error, the following is my observation/suggestion:
Probably, the following could be some of the causes for this error:
Memory Need:
a) The overhead from CUDA, NCCL, PyTorch, and TGI runtime, plus model sharding inefficiencies, would have caused out-of-memory errors.
Multi-GPU Sharding:
a) Proper multi-GPU distributed setup requires NCCL to work flawlessly and enough memory on each GPU to hold its shard plus overhead.
NCCL Errors in Docker on Windows/WSL2:
a) NCCL out-of-memory error can arise from driver or environment mismatches, more specifically in Windows Server with WSL2 backend.
b) We must check the compatibility of NCCL and CUDA versions. Ensure that Docker is configured correctly to expose the GPUs and shared memory.
My Suggestions or possible solutions you can try:
Test on a Single GPU First:
a) Try to load the model on a single GPU to confirm whether the model loads correctly without sharding. This will help to understand whether the issue is with model files or sharding.
b) If this works fine, then proceed to the other points below.
Increase Docker Shared Memory:
a) Allocate more shared memory, for example: Add `--shm-size=2g`or higher to the “docker run” command. ( docker run --gpus all --shm-size=2g)
Please do not set `CUDA_VISIBLE_DEVICES` Explicitly in Docker:
a) When you set <CUDA_VISIBLE_DEVICES> inside the container, it can sometimes interfere with NCCL's device discovery and cause errors.
Verify NCCL Debug Logs:
a) Please run the container with `NCCL_DEBUG=INFO` environment variable to get detailed NCCL logs and identify the exact failure point.
Please let me know if this approach works for you.