Relational Database → Amazon S3 (Batch Data Ingestion)

AWS • Ingestion • Beginner • Healthcare

Architecture Diagram

Overview

Healthcare applications store important day-to-day data such as patients, providers, encounters, diagnoses, medications, vitals, allergies, and discharge information inside databases.

But this data cannot always be used directly for reporting or analytics from the same application database. In real projects, data is usually moved from the source system into a data lake first, so other teams and pipelines can safely use it.

In this pipeline, you will build that first step.

You will take healthcare data from PostgreSQL, load it into Amazon S3, make the loaded data queryable, and schedule the pipeline so it can run repeatedly. Instead of loading the full database every time, the pipeline focuses on loading only new or changed records.

What You Will Build

  • Set up a healthcare source database in PostgreSQL
  • Connect AWS Glue to the source database
  • Load healthcare data into Amazon S3
  • Load only new or changed records during each pipeline run
  • Store the data in Parquet format inside the data lake
  • Make the loaded data available for querying
  • Query the data using Amazon Athena
  • Run the pipeline automatically on a schedule using Glue Workflows and Triggers

Tech Stack

PostgreSQL • AWS Glue • Amazon S3 • Glue Crawler • Athena • AWS IAM • Apache Parquet

Learning Outcomes

  1. Extract data from relational databases into AWS
  2. Build repeatable batch ingestion pipelines
  3. Load only new or changed records during pipeline execution
  4. Organize source data within a cloud data lake
  5. Validate ingested data using query tools such as Athena
  6. Schedule and automate data pipelines