Relational Database → Amazon S3 (Batch Data Ingestion)

AWS • Ingestion • Beginner • Healthcare

Foundations

Architecture Diagram

Overview

Healthcare applications store important day-to-day data such as patients, providers, encounters, diagnoses, medications, vitals, allergies, and discharge information inside databases.

But this data cannot always be used directly for reporting or analytics from the same application database. In real projects, data is usually moved from the source system into a data lake first, so other teams and pipelines can safely use it.

In this pipeline, you will build that first step.

You will take healthcare data from PostgreSQL, load it into Amazon S3, make the loaded data queryable, and schedule the pipeline so it can run repeatedly. Instead of loading the full database every time, the pipeline focuses on loading only new or changed records.

What You Will Build

Set up a healthcare source database in PostgreSQL
Connect AWS Glue to the source database
Load healthcare data into Amazon S3
Load only new or changed records during each pipeline run
Store the data in Parquet format inside the data lake
Make the loaded data available for querying
Query the data using Amazon Athena
Run the pipeline automatically on a schedule using Glue Workflows and Triggers

Tech Stack

PostgreSQL • AWS Glue • Amazon S3 • Glue Crawler • Athena • AWS IAM • Apache Parquet

Learning Outcomes

Extract data from relational databases into AWS
Build repeatable batch ingestion pipelines
Load only new or changed records during pipeline execution
Organize source data within a cloud data lake
Validate ingested data using query tools such as Athena
Schedule and automate data pipelines