S3 Landing → Lambda → Athena (Serverless Analytics)

Auto-validate files on S3 upload, register schemas, and query instantly with Athena – lightweight, low-cost analytics.

Difficulty: Beginner
Tech stack: S3, Lambda (S3 trigger), Glue (Crawler/Catalog), Athena, (optional) Transfer Family
Estimated time: 1-2 hrs

Overview

Files land in S3/raw (from vendor SFTP or app exports). An S3-put trigger invokes Lambda to validate naming and basic CSV/JSON sanity, optionally convert to Parquet into S3/processed with a partitioned path (e.g., `dt=YYYY-MM-DD/`). Lambda then starts a Glue Crawler (or updates tables via Glue APIs). Once the catalog is updated, data is immediately queryable in Athena. Athena stores results in a dedicated results bucket/workgroup; partitions and compression keep scans cheap. EventBridge can run lightweight housekeeping/compaction.

Outcome

Serverless pipeline with S3 + Lambda + Athena (no warehouse to manage).
Faster insights, lower cost via partitioned tables and schema registry.
Hands-off ingestion from vendor feeds into query-ready tables.

What you’ll build

S3 layout: raw/ and processed/ buckets/prefixes (date/vendor).
Lambda (S3 trigger) that validates filenames/format, optionally normalizes to Parquet, and kicks a Glue Crawler (or Glue API to create/update tables).
Glue Data Catalog databases/tables for Athena.
Athena config: workgroup + query-results bucket, sample queries, and cost guardrails (partitions/pruning).
(Optional) SFTP → S3 via AWS Transfer Family (or existing vendor drop).
(Optional) EventBridge rule for periodic housekeeping (expire old data, compact small files).