Raw Data → Silver Lakehouse on Amazon S3

AWS • Lakehouse • Intermediate • Healthcare

Architecture Diagram

Overview

In the first two pipelines, healthcare data was loaded into Amazon S3 using batch and CDC ingestion patterns.

But raw data is usually not ready for direct analytics. It may come from different sources, arrive at different times, contain duplicates, or include both batch and CDC records for the same business entity.

In this pipeline, you will build the next layer of the data platform.

You will take raw healthcare data from Amazon S3, process it using AWS Glue, and create clean Silver tables using Apache Iceberg. These Silver tables act as trusted lakehouse tables that can be used by downstream Gold, analytics, and reporting pipelines.

What You Will Build

• Set up Silver Iceberg tables on Amazon S3
• Read raw batch data and raw CDC data from Amazon S3
• Combine batch and CDC records based on the table type
• Keep only the latest version of each record
• Handle deleted records from CDC data
• Build trusted Silver tables for downstream processing
• Run Glue jobs to process healthcare datasets
• Orchestrate the Silver pipeline using Airflow on MWAA
• Validate Silver tables using Athena

Tech Stack

Amazon S3 • AWS Glue • Apache Iceberg • Apache Parquet • Amazon Athena • Amazon MWAA Apache Airflow • AWS Glue Catalog

Learning Outcomes

After completing this pipeline, you will be able to:

  1. Build Silver lakehouse tables on Amazon S3
  2. Process raw batch and CDC data into trusted tables
  3. Handle latest-record selection and CDC delete records
  4. Use Apache Iceberg tables for lakehouse storage
  5. Run Glue jobs to build Silver lakehouse tables
  6. Orchestrate lakehouse pipelines using Airflow and MWAA
  7. Validate Silver tables using Athena