Reports

After a few tries, I found sizeInBytes property in scala:

  val spark: SparkSession = SparkSession.builder().master("local[4]").getOrCreate()
  import spark.implicits._
  val values = List(1)
  val df: DataFrame = values.toDF()
  df.cache()
  println(df.count())
  df.explain("cost")
  print(df.queryExecution.optimizedPlan.stats.sizeInBytes)
  df.unpersist()
  spark.stop()

It's work. (source code for sizeInBytes property here, just trace the scala code from function df.explain("cost"): https://github.com/apache/spark/blob/v3.5.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L55)

So I tried with python (need to call function instead of property like scala, maybe calling to Java need it ¯\_(ツ)_/¯ )

    spark: SparkSession = SparkSession.builder.master("local[4]").getOrCreate()
    sc: SparkContext = spark.sparkContext
    df = spark.range(1)
    df.cache()
    # force df persist
    print(df.count())
    # force dataframe collect statistics so we can get the data size
    df.explain("cost")
    # df._jdf.show()
    print(df._jdf.queryExecution().optimizedPlan().stats().sizeInBytes())
    df.unpersist()
    spark.stop()

When I tested with dummy data, the df.explain("cost") is not needed, but I think we will need it in real case, because when I trace the code I saw it call to function collectWithSubqueries: https://github.com/apache/spark/blob/v3.5.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L545

79604695