Reports

This is the way we currently organise it. We've discussed our architecture with Databricks consultants assigned to us and so far there haven't been strong objections raised. That said, I believe in allowing for flexibility in the architecture based on the individual use case.

In the raw layer, we have raw files coming in from various sources. For example, an API call could produce a json output we store as a json file. We could copy stored backups of databases here as well. The idea is that the files are stored as is, because in the case where things go wrong, we want to at least still have our raw data. This is especially important when there is no way to retrieve historical data from your source - for example, snapshot data.

In bronze, we convert everything to delta tables. You can think of it as just a version of raw where everything is in delta format. We want to build our foundation on delta tables so that we have a common way to query and analyse our raw data should we want to.

In silver, we do the cleaning and transformation of data. As far as possible, we try to process anything that can be done incrementally here, as pyspark lends itself better to more readable implementations of complex transformations versus sql, and we want to keep our sql queries in gold as simple as possible.

In gold, we run queries that form the fact and dimension tables that make up our star schemas. Here, we run some aggregations and rename columns so that they are readable to our business users.

From there, you could set up a SQL warehouse or use Delta Sharing to connect to a BI Tool. Or, you could using the Silver or Gold tables for ML purposes.

P.S. Generally, I recommend using Unity Catalog as using the three layer namespace to query tables makes the code look far more readable. It also makes it easier to control access to certain catalogs/schemas/tables. Raw data could be stored as volumes and once you have delta tables in the bronze layer onwards, you can store them as tables.

P.P.S. With that said, I don't think you always need to have all the layers. In fact, we are considering getting rid of raw and bronze layers for data that gets to the silver layer very quickly and can be easily retrieved again at a later date, because there is a very low cost to rerunning the raw through silver layers on failure, but a relatively high cost to store, read and write them.

79407271