Event Hubs → Databricks Streaming → Synapse

Stream JSON from Event Hubs with Spark Structured Streaming, aggregate in Databricks, and upsert into Synapse

Difficulty: Intermediate
Tech stack: Event Hubs, Databricks (Spark Structured Streaming), Synapse (JDBC/SQL DW connector), ADLS (DLQ), Power BI (optional)
Estimated time: 2-3 hrs

Overview

Events land in Event Hubs and are consumed by a Databricks Structured Streaming job. The job parses JSON, enforces schema, assigns event-time timestamps, and uses watermarks to handle late/duplicate events. Each micro-batch lands in Synapse (staging) and then merges into modeled tables for low-latency analytics. Checkpointing ensures fault tolerance; invalid messages are written to ADLS as a DLQ. Power BI can read Synapse for live dashboards.

Outcome

Real-time ingestion from Event Hubs into Synapse analytics tables.
Trustworthy streams with schema enforcement, watermarks, and dedupe.
Resilient ops via checkpoints and idempotent upserts.

What you’ll build

Event Hubs namespace + event hub (topic) and a simple producer (Python).
Databricks notebook (Spark Structured Streaming) that:

1. Reads Event Hubs, parses JSON → columns, applies watermarks.
2. Computes real-time metrics (active sessions, order totals, etc.).
3. Writes micro-batches to Synapse staging (JDBC/SQL DW connector).
4. Runs MERGE/UPSERT into target fact/dim tables.
5. Sends bad records to ADLS (DLQ) with error context.
6. Checkpointing for exactly-once semantics across restarts.
7. (Optional) Power BI dashboard over Synapse tables.