79638791

Date: 2025-05-26 10:59:01
Score: 1.5
Natty:
Report link

As you are developing Medallion Architecture (Bronze > Silver > Gold) on Databricks with Unity Catalog, and your Azure Data Lake Gen2 structure with partitioned data.

You can follow this approach to robust system. Suppose this be your source file container in your ADLS Gen2 : abfss://bronze@<your_storage_account>.dfs.core.windows.net/adventureworks/year=2025/month=5/day=25/customer.csv

How should I create the bronze_customer table in Databricks to efficiently handle these daily files?

We can use Auto loader with Unity Catalog External Table. It is used for streaming ingestion scenarios where data is continuously landing in a directory.

Bronze Path is defined as

bronze_path = "abfss://bronze@<your_storage_account>.dfs.core.windows.net/adventureworks/"

Now, use Auto Loader to automatically ingest new CSV files as they arrive and store the data in the bronze_customer table for initial processing.

from pyspark.sql.functions import input_file_name

df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("header", "true")
    .option("cloudFiles.inferColumnTypes", "true")
    .load(bronze_path)
    .withColumn("source_file", input_file_name())
)

How do I create the table in Unity Catalog to include all daily partitions?

Now, write as a Delta table in Unity Catalog.

(
    df.writeStream
    .format("delta")
    .option("checkpointLocation", "abfss://bronze@<your_storage_account>.dfs.core.windows.net/checkpoints/bronze_customer")
    .partitionBy("year", "month", "day")
    .trigger(once=True)
    .toTable("dev.adventureworks.bronze_customer")
)

The year, month, and day fields must exist in the file or be extracted from the path.

So, Data will be loaded in adventureworks.bronze_customer

What is the recommended approach for managing full loads (replacing all data daily) versus incremental loads (appending only new or changed data) in this setup?

For Bronze level, Auto Loader ingests new files into a partitioned, append-only Delta table without reprocessing.

For Silver level, if source provides files every day then full load and source provides changes in system then Incremental load in recommended.

Full Refresh Load:

cleaned_df.write.format("delta") \
  .mode("overwrite") \
  .option("replaceWhere", "year=2025 AND month=5 AND day=25") \
.saveAsTable("dev.adventureworks.silver_customer")

Incremental Load:

from delta.tables import DeltaTable

silver = DeltaTable.forName(spark, "dev.adventureworks.silver_customer")

(silver.alias("target")
 .merge(new_df.alias("source"), "target.customer_id = source.customer_id")
 .whenMatchedUpdateAll()
 .whenNotMatchedInsertAll()
 .execute())

For Gold Layer, it depends on the types of aggregation applied but incremental load basically preferred.

This is just an architectural suggestion for your given inputs and asked question not an absolute solution.

Resouces you can refer for more details:

Auto Loader in Databricks
MS document for Auto Loader
Upsert and Merge

Reasons:
  • Blacklisted phrase (1): How do I
  • Blacklisted phrase (1): How should I
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Low reputation (0.5):
Posted by: Pritam