Try use float16 not bfloat16, it seems that bfloat16 will use more VRAM.
torch_dtype=torch.float16
or try to use xformers to reduce VRAM usage.
self.pip.enable_xformers_memory_efficient_attention()