RDBMS → Sqoop → HDFS/Hive (Batch Ingestion)

Architecture Diagram

rdbms-sqoop-hdfs-hive

Overview

You’ll configure Sqoop to pull from an OLTP database into HDFS; first a full load, then incremental loads using a watermark column. Data is stored in columnar formats and surfaced in Hive (external/managed, partitioned). Jobs run via shell/cron for repeatable batch ingestion.

What You Will Build

  • Docker/EMR lab with Hadoop + Hive.
  • Sqoop jobs: full import + incremental by last-modified.
  • Hive DDL for external/managed tables and partitions.
  • Shell/Cron to orchestrate.

Tech Stack

Hadoop (HDFS), Hive, Sqoop, MySQL, shell

Learning Outcomes

  • Ingest full + incremental tables from MySQL into HDFS with Sqoop.
  • Expose data via Hive tables (Parquet/ORC, partitioned).
  • Automate daily runs with shell.