As you are developing Medallion Architecture (Bronze > Silver > Gold) on Databricks with Unity Catalog, and your Azure Data Lake Gen2 structure with partitioned data.
You can follow this approach to robust system.
Suppose this be your source file container in your ADLS Gen2 :
abfss://bronze@<your_storage_account>.dfs.core.windows.net/adventureworks/year=2025/month=5/day=25/customer.csv
How should I create the bronze_customer table in Databricks to efficiently handle these daily files?
We can use Auto loader with Unity Catalog External Table. It is used for streaming ingestion scenarios where data is continuously landing in a directory.
Bronze Path is defined as
bronze_path = "abfss://bronze@<your_storage_account>.dfs.core.windows.net/adventureworks/"
Now, use Auto Loader to automatically ingest new CSV files as they arrive and store the data in the bronze_customer
table for initial processing.
from pyspark.sql.functions import input_file_name
df = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("cloudFiles.inferColumnTypes", "true")
.load(bronze_path)
.withColumn("source_file", input_file_name())
)
How do I create the table in Unity Catalog to include all daily partitions?
Now, write as a Delta table in Unity Catalog.
(
df.writeStream
.format("delta")
.option("checkpointLocation", "abfss://bronze@<your_storage_account>.dfs.core.windows.net/checkpoints/bronze_customer")
.partitionBy("year", "month", "day")
.trigger(once=True)
.toTable("dev.adventureworks.bronze_customer")
)
The year
, month
, and day
fields must exist in the file or be extracted from the path.
So, Data will be loaded in adventureworks.bronze_customer
What is the recommended approach for managing full loads (replacing all data daily) versus incremental loads (appending only new or changed data) in this setup?
For Bronze level, Auto Loader ingests new files into a partitioned, append-only Delta table without reprocessing.
For Silver level, if source provides files every day then full load and source provides changes in system then Incremental load in recommended.
Full Refresh Load:
cleaned_df.write.format("delta") \
.mode("overwrite") \
.option("replaceWhere", "year=2025 AND month=5 AND day=25") \
.saveAsTable("dev.adventureworks.silver_customer")
Incremental Load:
from delta.tables import DeltaTable
silver = DeltaTable.forName(spark, "dev.adventureworks.silver_customer")
(silver.alias("target")
.merge(new_df.alias("source"), "target.customer_id = source.customer_id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
For Gold Layer, it depends on the types of aggregation applied but incremental load basically preferred.
This is just an architectural suggestion for your given inputs and asked question not an absolute solution.
Resouces you can refer for more details:
Auto Loader in Databricks
MS document for Auto Loader
Upsert and Merge