← All courses
Coming up I teach this course Authored · Fall 2026

DS-551 · Data Engineering at Scale

Build and operate the systems that move data at production scale. A project-based course organized around the Epidemic Engine — pipelines, streams, containers, and the trade-offs that come with real infrastructure.

DS-551 (formerly DS-593) is about what happens to data before the data scientists get it — and after. Collection, movement, transformation, storage, publishing, monitoring: the unglamorous machinery that every dashboard, model, and “insight” quietly depends on.

The Epidemic Engine

The whole semester is organized around one system: the Epidemic Engine, a hypothetical-but-realistic application that gathers intelligence about potential health events, aggregates heterogeneous data, publishes it in multiple forms, and supports forecasting. It’s fictional, but its problems are not — imperfect data, changing requirements, trade-offs among latency, throughput, and cost, and the constant need for observability.

Almost everything you build, from your first API collector to your final streaming pipeline on OpenShift, plugs into it. By the end of the term you’ve assembled a real distributed system, piece by piece, and you can explain every decision in it.

How it runs

This is a fast-paced, project-based course. Most class meetings open with a short, closed-book check that you did the prep; lecture time then goes to the hardest concepts, trade-offs, and live demos. Three cumulative projects add capabilities to the Epidemic Engine, and teams document their design decisions and run-books the way working engineers do.

The goal is explicitly not to memorize tools. Kafka, Spark, and NiFi will be replaced by something else eventually; the documentation-driven engineering habits you build here won’t be.

Who it’s for

Upper-level undergraduates and master’s students who are comfortable in Python, have seen SQL, and want to understand how data systems actually work at enterprise scale. DS 310 (or equivalent) is the expected background.

Course materials

Reference documents for the course — read online or download. Assignments and weekly schedules live in the LMS, not here.

  • DS-551 Syllabus Syllabus Spring 2026

    The Spring 2026 course contract for Data Engineering at Scale — the Epidemic Engine, GAIEs, homeworks, the project, and the GenAI policy.

    PDF

Datasets we use

Materials here are from Spring 2026, the most recent taught term. Browse the full catalog →

  • Epidemic Engine — Health Events Self-hosted

    A synthetic stream of health-event records for the Epidemic Engine — the raw material for DS-551's ingestion and streaming pipelines.

Planned terms

  • Spring 2027 Taught

Past terms