FYI I would convert to float16 (2x smaller, almost no loss).
But best practical answer (keeps float32 precision at read time, shrinks disk by ~4x, gives millisecond random row fetch):
quantize to float8
1 byte per value instead of 4 bytes -> 500gb -> 125gb
benchmarks (Naamán Huerga-Pérez et al.) shows smaller than 0.3 quality loss.
Store as a single memory-mapped binary file
If you later want even smaller, combine light pca drop to 384 dims + float8 -> 60~GB total. thats the simplest, fastest route to both goals.