Kinesis → Glue Streaming → Redshift

Architecture Diagram

kinesis-glue-streaming-redshift

Overview

Events flow into Kinesis and are read by a Glue Streaming job using Spark Structured Streaming. The job parses JSON, applies light transforms, and lands rows to a Redshift staging table in micro-batches (e.g., every 1–5 minutes). A post-write step runs a MERGE to upsert into analytics tables, keeping them current without duplicates. Invalid records are parked in S3 (DLQ); CloudWatch and Glue provide throughput/latency/error metrics.

What You Will Build

A Kinesis Data Stream + simple event producer (Python).
Glue Streaming job (Spark Structured Streaming) that:

  1. Consumes JSON, parses/validates schema, enriches with lookups.
  2. Writes micro-batches into Redshift staging (JDBC/Redshift Data API).
  3. Invokes MERGE/UPSERT into target facts/dims.
  4. Sends bad records to a DLQ (S3) with error context.
  5. Basic monitoring/alerts (CloudWatch metrics, failure notifications).
  6. (Optional) Orchestration via Step Functions.

Tech Stack

Kinesis, Glue Streaming (Spark), Redshift (MERGE), CloudWatch, S3 (DLQ)

Learning Outcomes

  • Near real-time ingestion from Kinesis into Redshift.
  • Trustworthy tables via schema normalization and upsert/merge patterns.
  • Operational visibility with Glue job metrics and Kinesis throughput.

Recommended Before This

  • Near real-time ingestion from Kinesis into Redshift.
  • Trustworthy tables via schema normalization and upsert/merge patterns.
  • Operational visibility with Glue job metrics and Kinesis throughput.