Reports

So after much more investigation, I have determined that this is just the difference in space efficiency between Parquet and Pandas for the kind of data in my files; The dataset includes several Date and Decimal columns, which can be handled in a very memory efficient way by Parquet and Spark, but not by Pandas.

If someone else has a similar issue, I suggest moving your implementation to PySpark, which can handle this data much better. Unfortunately that is not an option for me, so I have had to fundamentally alter my approach.

79809517