Coming up I teach this course Authored · Fall 2026

DS-551 · Data Engineering at Scale

Build and operate the systems that move data at production scale. A project-based course organized around the Epidemic Engine — pipelines, streams, containers, and the trade-offs that come with real infrastructure.

DS-551 (formerly DS-593) is about what happens to data before the data scientists get it — and after. Collection, movement, transformation, storage, publishing, monitoring: the unglamorous machinery that every dashboard, model, and “insight” quietly depends on.

The Epidemic Engine

The whole semester is organized around one system: the Epidemic Engine, a hypothetical-but-realistic application that gathers intelligence about potential health events, aggregates heterogeneous data, publishes it in multiple forms, and supports forecasting. It’s fictional, but its problems are not — imperfect data, changing requirements, trade-offs among latency, throughput, and cost, and the constant need for observability.

Almost everything you build, from your first API collector to your final streaming pipeline on OpenShift, plugs into it. By the end of the term you’ve assembled a real distributed system, piece by piece, and you can explain every decision in it.

How it runs

This is a fast-paced, project-based course. Most class meetings open with a short, closed-book check that you did the prep; lecture time then goes to the hardest concepts, trade-offs, and live demos. Three cumulative projects add capabilities to the Epidemic Engine, and teams document their design decisions and run-books the way working engineers do.

The goal is explicitly not to memorize tools. Kafka, Spark, and NiFi will be replaced by something else eventually; the documentation-driven engineering habits you build here won’t be.

Who it’s for

Upper-level undergraduates and master’s students who are comfortable in Python, have seen SQL, and want to understand how data systems actually work at enterprise scale. DS 310 (or equivalent) is the expected background.

Course materials

Reference documents for the course — read online or download. Assignments and weekly schedules live in the LMS, not here.

DS-551 Syllabus Syllabus Spring 2026

The Spring 2026 course contract for Data Engineering at Scale — the Epidemic Engine, GAIEs, homeworks, the project, and the GenAI policy.

PDF

Datasets we use

Materials here are from Spring 2026, the most recent taught term. Browse the full catalog →

Epidemic Engine — Health Events Self-hosted

A synthetic stream of health-event records for the Epidemic Engine — the raw material for DS-551's ingestion and streaming pipelines.

Planned terms

Spring 2027 Taught

Past terms

Spring 2026 Taught Syllabus PDF
Fall 2025 Taught Syllabus PDF
Fall 2024 Taught Syllabus PDF
Spring 2024 Taught Syllabus PDF

Fall 2026 at a glance

Term-level dates from the BU academic calendar.

Sep 2 Classes begin
Sep 7 Labor Day — no classes
Oct 12 Indigenous Peoples' Day — no classes
Nov 25–29 Thanksgiving recess
Dec 10 Last day of classes
Dec 14–18 Final exams

Assignment deadlines → Blackboard

Due-dates are never posted on this site — Blackboard is always correct.

Next up: Fall 2026

Planned return offering.

Tools we use

Blackboard ↗
Weekly schedule, deadlines, announcements, and grades. When in doubt, Blackboard is correct.
Gradescope ↗
Submit work, get feedback, and file regrade requests.
Piazza ↗
Course Q&A and urgent notices. Post here instead of emailing the teaching team.
GitHub ↗
Hosts course repositories for assignments and team projects.
OpenShift ↗
The container platform where pipelines and services get deployed.
Apache Kafka ↗
Streaming backbone for event-driven pipelines.
Apache Spark ↗
Distributed compute for large-scale batch and ML workloads.
Apache NiFi ↗
Visual dataflow tool for building ingestion pipelines.

Quick help

After the course

Did this course help?

If something from this course shaped your work, research, another class, or how you think, I would like to hear about it.