External RDBMS → Glue → S3 → Redshift (Batch Ingestion + ELT)

Ingest external MySQL/Postgres into S3 with Glue (JDBC/SSL), transform to Parquet, and load Redshift for analytics.

Difficulty: Intermediate
Tech stack: Glue (Connection/Crawler/ETL), S3, Redshift, Secrets Manager, Step Functions (optional)
Estimated time: 1-2 hrs

Overview

Glue connects to the external RDBMS (secured via JDBC + SSL and Secrets Manager), crawls schemas, and runs an ETL job that extracts tables to S3, converting to partitioned Parquet. Redshift loads from S3 via `COPY` into staging, then merges into modeled fact/dim tables with appropriate distribution and sort keys. A daily workflow (Glue Triggers/Step Functions) drives extract → transform → load, supporting incremental upserts and simple row-count checks.

Outcome

Real-world external DB → AWS ingestion with secure JDBC/SSL.
Cheaper/faster queries via Parquet + tuned dist/sort keys in Redshift.
Repeatable daily upserts orchestrated by Glue Triggers/Step Functions.

What you’ll build

Glue Connection to external DB (JDBC, SSL certs, Secrets Manager).
Glue Crawler to register external schemas in the Data Catalog.
Glue ETL job: extract → S3 (staging) → transform to Parquet (partitioned).
Redshift COPY from S3 into staging, then MERGE/UPSERT to facts/dims.
Incremental load pattern (watermark/updated_at) + simple validations.
(Optional) Workflow with Glue Triggers / Step Functions; basic monitoring.