You can make use of repartition and bucketBy to achieve a better optimisation. Repartition in the front will also deal with any data skewness that you have in your data. Then performing a bucketBy over this repartitioned data, on a column with low cardinality will yield the best result imo.