After a few tries, I found sizeInBytes property in scala:
val spark: SparkSession = SparkSession.builder().master("local[4]").getOrCreate()
import spark.implicits._
val values = List(1)
val df: DataFrame = values.toDF()
df.cache()
println(df.count())
df.explain("cost")
print(df.queryExecution.optimizedPlan.stats.sizeInBytes)
df.unpersist()
spark.stop()
It's work. (source code for sizeInBytes property here, just trace the scala code from function df.explain("cost")
: https://github.com/apache/spark/blob/v3.5.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L55)
So I tried with python (need to call function instead of property like scala, maybe calling to Java need it ¯\_(ツ)_/¯ )
spark: SparkSession = SparkSession.builder.master("local[4]").getOrCreate()
sc: SparkContext = spark.sparkContext
df = spark.range(1)
df.cache()
# force df persist
print(df.count())
# force dataframe collect statistics so we can get the data size
df.explain("cost")
# df._jdf.show()
print(df._jdf.queryExecution().optimizedPlan().stats().sizeInBytes())
df.unpersist()
spark.stop()
When I tested with dummy data, the df.explain("cost")
is not needed, but I think we will need it in real case, because when I trace the code I saw it call to function collectWithSubqueries
: https://github.com/apache/spark/blob/v3.5.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L545