HDFS → PySpark → MySQL (Write-Back)

Aggregate HDFS data in PySpark and upsert clean MySQL tables via JDBC

Difficulty: Intermediate
Tech stack: HDFS, PySpark, MySQL, JDBC, Airflow/cron
Estimated time: 2 hrs

Overview

You’ll read raw/refined data from HDFS into Spark, compute daily aggregates (sales totals, active users, top products), and write them to MySQL using Spark’s JDBC sink. Loads are made idempotent by either overwriting targets or using key-based upserts. The job is scheduled with cron or orchestrated via Airflow so downstream apps and reports see fresh, consistent tables.

Outcome

Curated MySQL tables generated from lake data (daily metrics, top-N, etc.).
loads via overwrite or upsert/merge patterns to keep targets in sync.
Automated runs with cron or Airflow for hands-off delivery.

What you’ll build

PySpark job/notebook to read HDFS → aggregate → write via JDBC.
JDBC config + connector setup for MySQL.
MySQL DDL + sample schema for target tables.
Upsert strategies (e.g., `ON DUPLICATE KEY UPDATE`, stage-and-swap).
Airflow DAG using `SparkSubmitOperator`.

Join the waitlist