79476676

Date: 2025-02-28 21:01:24
Score: 1.5
Natty:
Report link

I did some testing (results below) to evaluate behavior of dynamic partitionOverwriteMode, as inspired by this blog, and confirmed that documentation is referring to Hive-style partitions only (as created by partitionBy) when selectively overwriting partitions. It does not selectively overwrite individual parquet files.

from pyspark.sql import Row
from pyspark.sql.functions import col

data = [
    Row(order_id=1, order_amount=100, year=2025, month=1, day=2),
    Row(order_id=2, order_amount=150, year=2025, month=1, day=5),
    Row(order_id=3, order_amount=200, year=2025, month=1, day=6),
    Row(order_id=4, order_amount=250, year=2025, month=2, day=6),
    Row(order_id=5, order_amount=300, year=2025, month=2, day=23),
    Row(order_id=6, order_amount=350, year=2024, month=12, day=11)
]

df_write_1 = spark.createDataFrame(data)
display(df_write_1)

Initial DataFrame written

df_write_1.write.mode("overwrite") \
    .partitionBy("year", "month") \
    .option("partitionOverwriteMode", "dynamic") \
    .option("maxRecordsPerFile", 1) \
    .parquet(s3_path)

data = [
    Row(order_id=1, order_amount=100, year=2025, month=1, day=2),
    Row(order_id=2, order_amount=150, year=2025, month=1, day=5),
    Row(order_id=3, order_amount=200, year=2025, month=1, day=6),
    Row(order_id=5, order_amount=300, year=2025, month=2, day=23),
    Row(order_id=7, order_amount=250, year=2025, month=2, day=23),
    Row(order_id=8, order_amount=500, year=2025, month=2, day=6),
    Row(order_id=9, order_amount=250, year=2025, month=3, day=23),
]

df_write_2 = spark.createDataFrame(data)
display(df_write_2)

New incremental DataFrame written.

df_write_2.write.mode("overwrite") \
    .partitionBy("year", "month") \
    .option("partitionOverwriteMode", "dynamic") \
    .option("maxRecordsPerFile", 1) \
    .parquet(s3_path)

df_read = spark.read.parquet(s3_path)
display(df_read.orderBy("order_id"))

DataFrame that is read after incremental dynamic overwrite performed.

Reasons:
  • Blacklisted phrase (1): this blog
  • Probably link only (1):
  • Long answer (-1):
  • Has code block (-0.5):
  • Self-answer (0.5):
  • Low reputation (0.5):
Posted by: Matthew Thomas