79452101

Date: 2025-02-19 16:33:29
Score: 1
Natty:
Report link

I did this in Python where you would read the file into a dataframe and then add whatever column you want. However, when you save new parquet file, the size of the file changed (much less) and then I was not able to read the new file with new columns in AWS Glue. It seems that new file gets compressed even if you specify no compression. Another way I found this can be done is to save files as csv - size was slightly bigger than parequet since you added more column(s). Then read csv's in Glue. Another way is to resave as JSON however this type of file if not compressed ended up being 4x (column names in each row) or 2x (first row column names only) larger than parquet due to the nature of the JSON. I wish there was a better way to deal with parquet files rather then simply saying you just read these types of files and that's it.

Reasons:
  • Long answer (-0.5):
  • No code block (0.5):
  • Single line (0.5):
  • Low reputation (0.5):
Posted by: Dmitri