79558283

Date: 2025-04-06 13:46:14
Score: 1.5
Natty:
Report link

Question 1
collect() always brings all the data in the df to the driver. Here shuffling is happening because the data has to be written first to the disk and then read by the driver and sorted in driver again. While show() just displays ( 1000 rows) from all or one of the partitions so data is not moved to the driver and since data is already sorted it does not need to shuffle. It can simply show 1000 rows based on start and end value of count from each of the partition it has read from sorted csv file . I am assuming that it is already sorted in the source csv because of TakeOrderedAndProject step shown in the plan for show().

Question 2
Looks like for collect() since it has to have metadata as well to be present in the driver reads every row with the header. I observed that if you change shuffle partition to 1 output rows is 4 itself. So I am guessing when shuffle partitions are more the tasks either read multiple times or read metadata like hearders for every rows. I found many threads for the same behavior but no solid answer
Why does metric "number of output rows" in Apache Spark UI shows a value higher than the size of the table when the table is used multiple times?

Reasons:
  • Long answer (-1):
  • No code block (0.5):
  • Ends in question mark (2):
Posted by: Vindhya G