I did some testing (results below) to evaluate behavior of dynamic
partitionOverwriteMode
, as inspired by this blog, and confirmed that documentation is referring to Hive-style partitions only (as created by partitionBy
) when selectively overwriting partitions. It does not selectively overwrite individual parquet files.
from pyspark.sql import Row
from pyspark.sql.functions import col
data = [
Row(order_id=1, order_amount=100, year=2025, month=1, day=2),
Row(order_id=2, order_amount=150, year=2025, month=1, day=5),
Row(order_id=3, order_amount=200, year=2025, month=1, day=6),
Row(order_id=4, order_amount=250, year=2025, month=2, day=6),
Row(order_id=5, order_amount=300, year=2025, month=2, day=23),
Row(order_id=6, order_amount=350, year=2024, month=12, day=11)
]
df_write_1 = spark.createDataFrame(data)
display(df_write_1)
df_write_1.write.mode("overwrite") \
.partitionBy("year", "month") \
.option("partitionOverwriteMode", "dynamic") \
.option("maxRecordsPerFile", 1) \
.parquet(s3_path)
data = [
Row(order_id=1, order_amount=100, year=2025, month=1, day=2),
Row(order_id=2, order_amount=150, year=2025, month=1, day=5),
Row(order_id=3, order_amount=200, year=2025, month=1, day=6),
Row(order_id=5, order_amount=300, year=2025, month=2, day=23),
Row(order_id=7, order_amount=250, year=2025, month=2, day=23),
Row(order_id=8, order_amount=500, year=2025, month=2, day=6),
Row(order_id=9, order_amount=250, year=2025, month=3, day=23),
]
df_write_2 = spark.createDataFrame(data)
display(df_write_2)
df_write_2.write.mode("overwrite") \
.partitionBy("year", "month") \
.option("partitionOverwriteMode", "dynamic") \
.option("maxRecordsPerFile", 1) \
.parquet(s3_path)
df_read = spark.read.parquet(s3_path)
display(df_read.orderBy("order_id"))
df_read
df_read
. This behavior was not originally in question but I added it to the test for good measure.