The issue was due to missing Spark configurations. While Spark executed the SELECT query, it didn’t leverage partitioning until I explicitly set:
sparkConf.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension");
sparkConf.set("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog");
spark.sql.extensions
: Enables Delta Lake optimizations.spark.sql.catalog.spark_catalog
: Ensures Spark treats the table as a Delta table, allowing partition pruning.Without these, Spark treated the table as a generic Parquet table, ignoring partitioning. After adding them, partition filters appeared in the physical plan, improving query efficiency. Zeppelin worked because it had these configs set.
However, DELETE queries still don’t apply partition pruning, even with these configs. The physical plan remains the same, causing a full table scan. Is this expected behavior in Delta Lake, or is there a way to optimize DELETE operations?
Would appreciate any insights!