ADF + Databricks → Medallion Architecture (Bronze/Silver/Gold)
- Difficulty: Intermediate
- Tech stack: Azure Data Factory, ADLS Gen2, Databricks (PySpark), Delta Lake (OPTIMIZE/VACUUM)
- Estimated time: 2 hrs
Overview
Files arrive in ADLS raw and are ingested by ADF into Bronze as Delta. Databricks notebooks read Bronze, apply cleaning and conformance to produce Silver (row-level quality, proper types, SCD-ready joins). A second notebook aggregates/enriches into Gold for BI/ML. Tables are partitioned (e.g., by date) and Z-Ordered on common filters; periodic `OPTIMIZE`/`VACUUM` keeps storage and query performance healthy. ADF triggers orchestrate Bronze→Silver→Gold with clear run logs.
Outcome
- Lakehouse-ready data on Azure using the Medallion pattern.
- Reliable Delta tables with ACID, schema enforcement, and time travel.
- Faster queries via partitioning, Z-Order, and periodic OPTIMIZE/VACUUM.
What you’ll build
- ADLS Gen2 layout: raw/bronze, silver, gold (clear folder/table conventions).
- ADF pipeline to ingest CSV/JSON into Bronze (landing → Bronze Delta).
- Databricks notebooks to transform Bronze → Silver (dedupe, schema, joins) and Silver → Gold (aggregations/KPIs).
- Delta performance ops: partitioning, Z-Order, `OPTIMIZE` + `VACUUM` schedule.
- (Optional) Lightweight DQ checks (row counts, null %, simple constraints).
- (Optional) ADF triggers for end-to-end orchestration + dependencies.