TLDR: decrease bucket size to (4,6,8 etc for each day) depending upon the files size being created. Too many small file ISSUE.
acc to my understanding: buckets : 1000 days(lets assume for 1 year): 365 thats : 365*1000 partitions(files) = 365000
now to write out those no of files and process them in parallel , you would need 1000 of cpu cores, does not make any sence. thump rule for file partitions and parallel execution: lets say for 25gb data file size: 25*1024 => 25600 MB partitions:(default 128MB): 25600/128 => 200 CPU cores needed: 200 (dialing down to 50 cores) // this is for all partitions being processed in parallel for fastest // execution, if you can give some more time for job, and data being // processed in queue with 50 cores would be enough // execution time will increase by 4x (just approxmimation, depends upon // transformations on data being performed)
executor needed(if 4 cores with each exec(2-5 rec for each exec)):50/4 =>~ 13 memory per exec: 4GB , total mem = 13* 4 = 52GB
then driver also need resource allocation (4 cores + 16 GB) rough estimates