DS-551 · Data Engineering at Scale
Build and operate the systems that move data at production scale. A project-based course organized around the Epidemic Engine — pipelines, streams, containers, and the trade-offs that come with real infrastructure.
DS-551 (formerly DS-593) is about what happens to data before the data scientists get it — and after. Collection, movement, transformation, storage, publishing, monitoring: the unglamorous machinery that every dashboard, model, and “insight” quietly depends on.
The Epidemic Engine
The whole semester is organized around one system: the Epidemic Engine, a hypothetical-but-realistic application that gathers intelligence about potential health events, aggregates heterogeneous data, publishes it in multiple forms, and supports forecasting. It’s fictional, but its problems are not — imperfect data, changing requirements, trade-offs among latency, throughput, and cost, and the constant need for observability.
Almost everything you build, from your first API collector to your final streaming pipeline on OpenShift, plugs into it. By the end of the term you’ve assembled a real distributed system, piece by piece, and you can explain every decision in it.
How it runs
This is a fast-paced, project-based course. Most class meetings open with a short, closed-book check that you did the prep; lecture time then goes to the hardest concepts, trade-offs, and live demos. Three cumulative projects add capabilities to the Epidemic Engine, and teams document their design decisions and run-books the way working engineers do.
The goal is explicitly not to memorize tools. Kafka, Spark, and NiFi will be replaced by something else eventually; the documentation-driven engineering habits you build here won’t be.
Who it’s for
Upper-level undergraduates and master’s students who are comfortable in Python, have seen SQL, and want to understand how data systems actually work at enterprise scale. DS 310 (or equivalent) is the expected background.
Course materials
-
DS-551 Syllabus Syllabus
The Spring 2026 course contract for Data Engineering at Scale — the Epidemic Engine, GAIEs, homeworks, the project, and the GenAI policy.
Datasets we use
-
Epidemic Engine — Health Events Self-hosted
A synthetic stream of health-event records for the Epidemic Engine — the raw material for DS-551's ingestion and streaming pipelines.
Planned terms
- Spring 2027
Past terms
- Spring 2026 Syllabus PDF
- Fall 2025 Syllabus PDF
- Fall 2024 Syllabus PDF
- Spring 2024 Syllabus PDF