79723303

Date: 2025-08-02 10:56:00
Score: 0.5
Natty:
Report link

FYI I would convert to float16 (2x smaller, almost no loss).

But best practical answer (keeps float32 precision at read time, shrinks disk by ~4x, gives millisecond random row fetch):

  1. quantize to float8

    • 1 byte per value instead of 4 bytes -> 500gb -> 125gb

    • benchmarks (Naamán Huerga-Pérez et al.) shows smaller than 0.3 quality loss.

  2. Store as a single memory-mapped binary file

    • one line gives you any subset of rows in ~1-2 ms with no servers no polars.

If you later want even smaller, combine light pca drop to 384 dims + float8 -> 60~GB total. thats the simplest, fastest route to both goals.

Reasons:
  • Long answer (-0.5):
  • No code block (0.5):
  • Low reputation (0.5):
Posted by: Z'K