Reports

If your goal is to optimize your spark code, you might find better performance using the spark SQL module and operating on dataframes or datasets instead of on RDDs. RDDs are lower level data structures that do not have a lot of the features and performance optimizations as dataframes/datasets.

From the documentation:

Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations

As for why your two versions have different runtimes, it is hard to say for certain without seeing the query plan (although caching may introduce some overhead). However, it's possible that spark may be recomputing some of the intermediate RDDs by moving the writes to the end of the file. Here's how that may be happening:

Version 1:

Read csv and calculate/write transformedRdd
Read the written transformedRdd and calculate/write filteredRdd
Read the written filteredRdd and calculate/write uniqueRows

Version 2:

Read csv and calculate/write uniqueRows
Read csv and calculate/write transformedRdd
Read the written filteredRdd and calculate/write filteredRdd

Using dataframes or RDDs either way, it would be helpful to debug performance to view the query plan to see where performance bottlenecks may be. You should be able to view this in the web UI.

79677074