Glue → Data Lakehouse on S3 (Parquet)

Build bronze/silver/gold zones on S3 with Glue – Parquet, partitions, and cataloged tables ready for Athena/Spectrum.

Difficulty: Intermediate
Tech stack: AWS Glue (ETL/Catalog), S3, Parquet, Athena / Redshift Spectrum
Estimated time: 2 hrs

Overview

Files land in bronze exactly as received. Glue jobs standardize schema, clean/transform, and write columnar Parquet into silver with a consistent partitioned path. A second job aggregates/denormalizes into gold for analytics. The Glue Catalog registers tables so Athena/Spectrum can query instantly. Triggers/Workflows schedule bronze→silver→gold; a periodic compaction step merges small files to improve scan efficiency.

Outcome

Lakehouse-style layers (raw → curated → analytics) on plain S3.
Faster/cheaper queries via Parquet + partitioning and table metadata.
Repeatable ETL with Glue triggers/workflows.

What you’ll build

S3 layout: bronze/ (raw) → silver/ (cleaned) → gold/ (BI-ready).
Glue ETL jobs (PySpark) to read JSON/CSV → standardize → write Parquet.
Partitioning strategy (e.g., `dt=YYYY-MM-DD`, `product=`) and small-file compaction.
Glue Data Catalog databases/tables for each zone.
Query paths via Athena or Redshift Spectrum.
(Optional) lightweight DQ checks (row counts, nulls) and a manifest.

Join the waitlist