79809517

Date: 2025-11-05 00:36:42
Score: 1
Natty:
Report link

So after much more investigation, I have determined that this is just the difference in space efficiency between Parquet and Pandas for the kind of data in my files; The dataset includes several Date and Decimal columns, which can be handled in a very memory efficient way by Parquet and Spark, but not by Pandas.

If someone else has a similar issue, I suggest moving your implementation to PySpark, which can handle this data much better. Unfortunately that is not an option for me, so I have had to fundamentally alter my approach.

Reasons:
  • Long answer (-0.5):
  • No code block (0.5):
  • Self-answer (0.5):
  • Low reputation (0.5):
Posted by: AngusB