79482734

Date: 2025-03-04 05:32:26
Score: 0.5
Natty:
Report link

TLDR: decrease bucket size to (4,6,8 etc for each day) depending upon the files size being created. Too many small file ISSUE.

acc to my understanding: buckets : 1000 days(lets assume for 1 year): 365 thats : 365*1000 partitions(files) = 365000

now to write out those no of files and process them in parallel , you would need 1000 of cpu cores, does not make any sence. thump rule for file partitions and parallel execution: lets say for 25gb data file size: 25*1024 => 25600 MB partitions:(default 128MB): 25600/128 => 200 CPU cores needed: 200 (dialing down to 50 cores) // this is for all partitions being processed in parallel for fastest // execution, if you can give some more time for job, and data being // processed in queue with 50 cores would be enough // execution time will increase by 4x (just approxmimation, depends upon // transformations on data being performed)

executor needed(if 4 cores with each exec(2-5 rec for each exec)):50/4 =>~ 13 memory per exec: 4GB , total mem = 13* 4 = 52GB

then driver also need resource allocation (4 cores + 16 GB) rough estimates

Reasons:
  • Long answer (-1):
  • No code block (0.5):
  • Low reputation (1):
Posted by: Nitish Bhardwaj