Reports

When you call df2.cache(), Spark begins to cache the result of the DataFrame df2, but the action to materialize this cache is still delayed until the action is executed.

The transformation pipeline is executed two times when the df2.cache() call is made and df2.show() is executed.

The correct order of executions is below: Make changes in your code:

df2.count() #Transformations will be materialized
df2.cache()
df2.show() # Now it fetches from the cached data

79320007