Post

Designing a Netflix-Style Streaming + Recommendation Platform

A complete two-part guide to a Netflix-like video streaming and recommendation system — From First Principles to Production Checklist — Part I builds the mental model (the why and what), Part II is a phase-by-phase implementation checklist (the how), each item reasoned and gated. Written for rookies, engineers, architects, PMs, and stakeholders alike.

Designing a Netflix-Style Streaming + Recommendation Platform

How to read this guide. It comes in two halves. Part I — Understand builds the mental model: how to reason about the system, the data, and the two-stage funnel. Each step there carries three lenses — 🟢 Plain-English (novice/stakeholder), 🔵 Engineering (builders), 🟣 Strategic (PMs/architects/business). Part II — Build turns that understanding into an ordered, checkable implementation plan, where every checkbox carries a one-line why and every phase closes with a Definition-of-Done gate. Read Part I for the story; reach for Part II when you’re shipping.

Jargon is kept and explained, never dropped — two-tower, ANN, NDCG, feature store, ABR, DRM, canary, drift — all defined in place.


Part I — Understand: The Mental Model

The One-Line Problem

Design a video streaming platform like Netflix — a subscription service where members watch TV shows and movies on the Web, iOS, Android, and TV. We focus on the high level, covering four capabilities: content upload, search, view (stream), and recommendations.

This is a hybrid problem: part classic distributed-systems design (storing and serving huge video files reliably) and part ML-systems design (predicting what each user wants to watch next). A good answer respects both halves.


0. The Mental Model — How to Think Before You Build

Most rookies jump straight to “I’ll use a neural network.” Strong designers slow down and walk a funnel: clarify → quantify → design data → design model → design serving → evaluate → operate. The diagram below is the skeleton every later section hangs on.

flowchart TD
    A["1 · Clarify Requirements<br/>functional + non-functional"] --> B["2 · Frame as ML Problem<br/>pick the objective + metric"]
    B --> C["3 · Estimate Scale<br/>users, QPS, storage, latency budget"]
    C --> D["4 · Design the Data<br/>sources, labels, features"]
    D --> E["5 · Pick the Architecture<br/>two-stage funnel"]
    E --> F["6 · Train the Models<br/>retrieval + ranking"]
    F --> G["7 · Serve at Scale<br/>online + offline, CDN"]
    G --> H["8 · Evaluate<br/>offline + online / A-B"]
    H --> I["9 · Operate<br/>monitoring, drift, retraining"]
    I -.feedback loop.-> D

    style A fill:#16a34a,color:#fff
    style B fill:#16a34a,color:#fff
    style E fill:#2563eb,color:#fff
    style F fill:#2563eb,color:#fff
    style G fill:#2563eb,color:#fff
    style H fill:#7c3aed,color:#fff
    style I fill:#7c3aed,color:#fff

Checklist instinct: never skip a box. Interviewers (and real launches) fail designs that have a brilliant model but no data plan, or great accuracy but a 5-second latency. The candidates who fail ML system design interviews can explain attention mechanisms but cannot answer “how do you serve this at 10,000 predictions per second with p99 latency under 50ms?”


1. Clarify the Requirements

🟢 Plain-English. Before designing anything, agree on what the system must do (functional) and how well it must do it (non-functional). It’s like agreeing on the destination and the speed limit before driving.

🔵 Engineering / 🟣 Strategic.

Functional requirements (from the problem statement):

#RequirementPrimary owner
F1The content team uploads new videos (movies, episodes).Ingestion pipeline
F2Users stream and interact (like, dislike, share).Playback + interaction services
F3Users get personalized recommendations.ML recommendation system
F4Users search the catalog (title, actor, genre).Search service

Non-functional requirements (NFRs):

#NFRConcrete target (state assumptions aloud)
N1High availability, minimal latency99.99% availability; video startup < ~2s; recommendation render < ~200ms
N2Scalability & efficiencyStream to millions concurrently without failure
N3Global, multi-deviceWeb, iOS, Android, TV across regions
N4DurabilityNever lose uploaded master content

Pro move: turn vague NFRs into numbers and say them out loud as assumptions. “Minimal latency” is not designable; “p99 < 200ms for the recommendation row” is.

Scope-narrowing questions to ask (and answer yourself):

  • Read-heavy or write-heavy? → Massively read-heavy (millions watch; a small content team uploads).
  • Do we build the streaming infra and the recommender? → Yes, both, at high level.
  • Real-time personalization or daily-batch? → Both (batch candidates + real-time signals).

2. Frame It as a Machine Learning Problem

🟢 Plain-English. A recommender’s job is to guess what you’ll enjoy watching. We turn “enjoy” into a number the computer can optimize — usually the probability you’ll watch a title for a meaningful length of time, not just click it.

🔵 Engineering. This is a ranking problem: given a user and candidate videos, predict and sort by P(meaningful engagement). For a video recommendation system, the goal is to maximize long-term user watch time, not just immediate clicks; the ML task is ranking — given a user and candidate videos, predict and sort by the probability the user watches each for a meaningful duration.

🟣 Strategic. Choosing the objective metric is the most important decision in the whole design — it silently shapes everything downstream.

Optimize for…Risk
Clicks (CTR)Clickbait thumbnails, regret
Watch-timeLong boring binges, autoplay traps
Long-term retention / satisfactionHard to measure, but what the business actually wants

Always start by clarifying the goal metric — engagement, revenue, or fairness — before drawing the pipeline. A practical compromise is multi-objective: predict several signals (watch, completion, like, return-next-week) and combine them.


3. Estimate the Scale (Back-of-Envelope)

🟢 Plain-English. We do quick “napkin math” to know whether we’re building a bicycle or a rocket. The numbers decide the technology.

🔵 Engineering. Sample assumptions for a Netflix-scale system (state them, then compute):

  • ~260M+ subscribers; assume ~100M daily active. Netflix streams content to over 260 million subscribers globally.
  • Each home-screen load needs personalized rows → millions of recommendation QPS at peak.
  • Catalog: ~100K+ titles, but the candidate space is millions of (title × context) combinations.
  • A single title is encoded into ~1,200 files. Netflix encodes each title into approximately 120+ streams across 10+ resolutions (240p to 4K) × multiple bitrates × audio tracks × subtitle tracks, then segments each into 2–10s chunks.

The key consequence — why we can’t score everything. Running a heavy model over millions of items for every user, every request, is computationally impossible. Multi-stage architectures exist because evaluating every content object for every user becomes computationally impossible. This single fact forces the two-stage funnel in Step 5.

Latency budget is a hard constraint, not a wish. The final ranking stage typically runs within a latency budget of roughly 100 milliseconds. Every design choice spends from this budget.


4. Design the Data — The Real Engine

🟢 Plain-English. Models are only as good as the examples they learn from. We need to collect what users did and turn it into labelled lessons.

🔵 Engineering.

(a) Data sources

  • Interaction / event data: plays, pauses, completions, likes, dislikes, shares, search queries, scroll/skip. Streamed in real time.
  • Content metadata: genre, cast, language, runtime, release year, artwork, and learned content embeddings (from video frames, text, audio).
  • Context: device, time of day, region, session history.

Massive data is ingested in real time from user interactions and content metadata; stored in distributed systems like Cassandra and S3; Kafka handles real-time streaming while Spark and Flink handle batch and stream processing.

(b) Labels — the subtle part

  • Implicit positives: a completed watch, a like, a re-watch.
  • Implicit negatives: impressions shown but not acted on are treated as negatives, while clicks, watch-time, and completions are positives.
  • Beware the feedback loop: you only get labels for items you showed. Items never recommended never get a chance — handle with exploration (below).

(c) Feature engineering

flowchart LR
    subgraph User["User features"]
        U1[Demographics] 
        U2[Watch history embedding]
        U3[Recent session signals]
    end
    subgraph Item["Item features"]
        I1[Genre / cast / language]
        I2[Popularity + age]
        I3[Content embedding]
    end
    subgraph Cross["Cross features"]
        C1["User x Genre affinity"]
        C2["User x Item past interaction"]
    end
    User --> FS[(Feature Store)]
    Item --> FS
    Cross --> FS
    FS --> RET[Retrieval model]
    FS --> RANK[Ranking model]

    style FS fill:#2563eb,color:#fff

Features include user features (demographics, engagement history), item features (category, popularity, age), and cross features (user-item interaction history).

🟣 Strategic. A Feature Store keeps offline-training features and online-serving features consistent — this “train/serve skew” is one of the most common silent killers of ML systems.


5. Pick the Architecture — The Two-Stage Funnel

This is the heart of the design and the single most reused pattern in industry.

🟢 Plain-English. Imagine a talent show with a million applicants. You can’t interview all of them. So first a fast, rough filter picks the few hundred most promising (you’d rather over-include than miss a star). Then a slow, careful judge ranks just those few hundred precisely. Cheap-and-broad first, expensive-and-precise second.

🔵 Engineering. The multi-stage funnel — candidate generation through two-tower retrieval, progressive ranking, and re-ranking with business guardrails — is the framework used by Netflix, YouTube, and Amazon serving billions of recommendations daily.

flowchart TD
    U[User opens app] --> CG

    subgraph Stage1["STAGE 1 · Candidate Generation (Retrieval)"]
        CG["Millions of items ->  ~1,000 candidates<br/>GOAL: recall · FAST · cheap"]
        CG --> M1[Collaborative filtering]
        CG --> M2[Content-based]
        CG --> M3["Two-tower + ANN search (FAISS / ScaNN)"]
    end

    M1 --> MERGE[Merge + dedupe]
    M2 --> MERGE
    M3 --> MERGE

    MERGE --> Stage2

    subgraph Stage2["STAGE 2 · Ranking"]
        RANK["Score each candidate precisely<br/>GOAL: precision · heavy DNN / GBT"]
    end

    RANK --> Stage3

    subgraph Stage3["STAGE 3 · Re-ranking / Guardrails"]
        RR["Diversity · freshness · business rules<br/>de-dup genres · cold titles · policy"]
    end

    RR --> OUT[Final personalized rows]

    style Stage1 fill:#dcfce7,stroke:#16a34a
    style Stage2 fill:#dbeafe,stroke:#2563eb
    style Stage3 fill:#f3e8ff,stroke:#7c3aed
    style OUT fill:#16a34a,color:#fff

5.1 Stage 1 — Candidate Generation (Retrieval)

The first stage retrieves a broad set from the entire catalog — thousands of candidates from millions — must be extremely fast, and prioritizes recall over precision. Multiple strategies run in parallel, each retrieving a few hundred, then merged. Each strategy retrieves 200–300 candidates; merge and dedupe to get 500–1,000.

The dominant technique is the two-tower neural network: one tower encodes user features into a dense embedding vector, the other encodes item features into the same embedding space; during offline processing, item embeddings are precomputed and indexed. At request time you embed the user and do an Approximate Nearest Neighbor (ANN) lookup — finding the closest item vectors fast without comparing against every item.

5.2 Stage 2 — Ranking

Score each candidate with a heavier ranking model — a deep neural network or gradient-boosted tree — using user, item, and cross features. Because it runs on only ~1,000 items, it can afford to be expensive and precise.

A modern refinement is Multi-Task Learning (MTL) — one shared model predicting several signals (watch, like, completion) at once. Netflix’s “Hydra”: consolidating diverse ranking signals and models into a single shared model that performs multiple tasks — ranking different rows, videos, and games. Generating recommendations for multiple use cases from one multi-task model improves performance and simplifies the architecture, improving maintainability.

5.3 Stage 3 — Re-ranking & Guardrails

🟣 Strategic. The model’s “best” list is not always the business’s best list. Re-ranking applies filters for diversity, freshness, or ads. Examples: don’t show five thrillers in a row, surface new releases, respect content licensing per region, inject some exploration.

Netflix-specific touches worth naming: each home-page row (Because you watched X, Trending Now) is its own algorithm, and even the artwork/thumbnail is personalized — a romance fan and an action fan see different images for the same title. Each home-page row is generated by a different algorithm, and Netflix selects which thumbnail to show based on user preferences.


6. Train the Models — Offline vs Online

🟢 Plain-English. Some learning happens slowly overnight on huge data (“batch”); some adjusts within your live session (“online”). Netflix does both.

🔵 Engineering. The recommendation pipeline runs offline (batch processing with Spark) to generate candidate sets and online (real-time model serving) for live personalization. Models train on distributed frameworks like TensorFlow and PyTorch on GPU clusters, deploy in containers (Docker, Kubernetes), and continuously learn from new data.

  • Cold start (new user / new title): fall back to popularity, content-based similarity, and onboarding signals until enough interaction data accrues. Always state your cold-start and fallback plan.
  • Exploration vs exploitation: use contextual bandits to occasionally show uncertain items and learn from the result. A contextual bandit optimizes decision-making by balancing exploration and exploitation using user-specific or situational features.
  • Frontier (2025–26): Large Foundation Models (LFMs) are being folded in — Netflix detailed how Large Foundation Models can enhance recommenders by ingesting world knowledge.

7. Serve at Scale — Where ML Meets the Streaming Backbone

This step joins the recommender to the video delivery machinery. A novice often forgets: recommending the title is useless if the video won’t play instantly.

7.1 The Streaming Backbone (the non-ML half)

🟢 Plain-English. When you press play, the video doesn’t travel from one faraway server. A copy already sits in a box inside your internet provider, so it arrives instantly and adjusts quality to your connection.

🔵 Engineering.

  • CDN — Open Connect: Netflix operates its own CDN, Open Connect; Open Connect Appliances (OCAs) are custom servers placed directly inside ISP networks, each with 100+ TB of storage pre-loaded with popular content. Popular content is pushed to the edge during off-peak hours.
  • Encoding pipeline: ingest the master → per-title / per-shot analysis → encode many renditions → segment → package (HLS/DASH) → distribute. Per-title encoding analyzes each shot’s complexity to allocate bitrate efficiently, producing ~1,200 versions across resolutions, codecs (H.264, H.265, VP9, AV1) and audio formats, then DRM-packaged and pushed to OCAs worldwide. Note this stage is itself ML-assisted via VMAF quality scoring.
  • Adaptive Bitrate (ABR): ABR dynamically adjusts video quality based on the viewer’s network conditions, downloading small 2–10s segments and switching quality smoothly to avoid interruptions.
  • Microservices + resilience: Netflix uses circuit breakers, fallbacks, bulkheads, and timeouts, and pioneered chaos engineering with Chaos Monkey, which randomly disables instances to ensure recovery from failures.

7.2 End-to-End Sequence — “User Opens the App and Presses Play”

sequenceDiagram
    actor User
    participant Client as Client (TV/iOS/Web)
    participant GW as API Gateway (Zuul)
    participant Rec as Recommendation Service
    participant FS as Feature Store
    participant ANN as Retrieval (Two-Tower + ANN)
    participant Rank as Ranking Service
    participant Play as Playback Service
    participant CDN as Open Connect (OCA)

    User->>Client: Open app
    Client->>GW: GET /home (userId, context)
    GW->>Rec: Request personalized rows
    Rec->>FS: Fetch user + context features
    Rec->>ANN: Retrieve ~1,000 candidates (recall)
    ANN-->>Rec: Candidate set
    Rec->>Rank: Score candidates (precision)
    Rank-->>Rec: Ranked list
    Rec-->>GW: Re-ranked rows (diversity, freshness)
    GW-->>Client: Personalized home screen
    Note over Client,CDN: --- User presses Play ---
    User->>Client: Click title
    Client->>GW: Request playback
    GW->>Play: Authorize + get manifest
    Play-->>Client: HLS/DASH manifest + signed CDN URLs + DRM license
    Client->>CDN: Fetch nearest OCA segments
    CDN-->>Client: Stream segments (ABR adapts to bandwidth)
    Client-->>Rec: Emit interaction events (play, pause, complete)
    Note over Rec: Events feed back into training data (Step 4)

On playback the system returns an HLS/DASH manifest with signed, tokenized CDN URLs and a parallel DRM license request, picks the nearest healthy Open Connect node by latency and load, targets startup under ~2s, and continuously upshifts/downshifts bitrate without oscillation.


8. Evaluate — Prove It Works (Twice)

🟢 Plain-English. We check the model two ways: on past data in the lab, and on real users in a controlled live test. Lab-good can still be live-bad.

🔵 Engineering.

Offline metrics (before launch): Precision@K, Recall@K, NDCG (rank quality), AUC. Use the right one — under heavy class imbalance, use PR-AUC rather than ROC-AUC.

Online evaluation (the real verdict): A/B testing. Cover online evaluation, guardrail metrics, and failure modes. Watch guardrail metrics (latency, error rate, complaint rate) alongside the primary metric so a “win” doesn’t secretly harm the experience.

🟣 Strategic. Tie every experiment back to the Step 2 objective (long-term retention), not just the easy proxy (clicks). The gap between proxy and true goal is where products quietly degrade.


9. Operate — Monitoring, Drift, Retraining

🟢 Plain-English. Tastes change, new shows arrive — a model left alone slowly goes stale. We watch it and refresh it.

🔵 Engineering. Cover drift detection, guardrail metrics, retraining cadence, and failure modes. Concept drift is constant; retrain frequently. Online learning and incremental training keep models updated.

This is MLOps: versioned data + models, automated pipelines, canary deploys, rollback, and observability. Netflix uses Atlas for dimensional time-series telemetry with streaming alerting, and Kayenta for automated canary analysis in the deployment workflow.

flowchart LR
    Live[Live serving] --> Mon[Monitor: latency, drift, business KPIs]
    Mon -->|drift detected| Retrain[Retrain / fine-tune]
    Retrain --> Canary[Canary + A-B]
    Canary -->|win| Live
    Canary -->|regression| Rollback[Rollback]
    Rollback --> Live
    style Mon fill:#7c3aed,color:#fff
    style Canary fill:#2563eb,color:#fff

10. Trade-offs to Say Out Loud

Strong designers narrate tensions rather than pretending there’s one right answer.

TensionLever ALever BTypical resolution
Accuracy vs latencyHeavier ranking modelTighter latency budgetTwo-stage funnel splits the difference
Recall vs precisionStage 1 (recall)Stage 2 (precision)Each stage owns one goal
Freshness vs stabilityOnline learningNightly batchHybrid batch + real-time
Personalization vs diversityPure model scoreRe-rank guardrailsStage 3 enforces diversity
Build vs buy (CDN)Own CDN (Open Connect)Third-party CDNOwn at extreme scale; buy early on
Exploitation vs explorationServe best-knownContextual banditsSmall exploration budget

Talk trade-offs explicitly: accuracy vs compute cost; two-tower for retrieval, deep networks for ranking, multi-task learning for consolidation.


Topic / Dependency Tree (Index)

Read top-down: each node depends on the ones above it. This is the “what must I understand before what” map.

flowchart TD
    R["Video Streaming + Recommender System"]

    R --> A["A · Requirements (Functional + NFR)"]
    R --> B["B · ML Problem Framing"]

    A --> A1[Functional: upload, stream, search, recommend]
    A --> A2["NFR: availability, latency, scale, durability"]

    B --> B1["Objective metric: long-term watch / retention"]
    B --> B2["Task type: ranking"]

    B1 --> C["C · Data Foundation"]
    C --> C1[Sources: events, metadata, context]
    C --> C2[Labels: implicit pos/neg, feedback loop]
    C --> C3["Features + Feature Store"]
    C3 --> C3a[Embeddings: user / item / content]

    C --> D["D · Two-Stage Architecture"]
    D --> D1["D1 · Candidate Generation (recall)"]
    D1 --> D1a["Two-tower + ANN (FAISS/ScaNN)"]
    D1 --> D1b[Collaborative + content-based]
    D --> D2["D2 · Ranking (precision)"]
    D2 --> D2a["DNN / GBT + cross features"]
    D2 --> D2b["Multi-Task Learning (Hydra)"]
    D --> D3["D3 · Re-ranking (diversity/business)"]

    D --> E["E · Serving"]
    E --> E1[Offline batch: Spark]
    E --> E2[Online real-time serving]
    E --> E3["Cold start + bandits"]
    E --> E4["Streaming backbone"]
    E4 --> E4a["CDN: Open Connect / OCA"]
    E4 --> E4b["Encoding + per-title + VMAF"]
    E4 --> E4c["ABR + HLS/DASH + DRM"]
    E4 --> E4d["Microservices + resilience"]

    E --> F["F · Evaluation"]
    F --> F1["Offline: NDCG, PR-AUC, Recall@K"]
    F --> F2["Online: A/B + guardrails"]

    F --> G["G · Operations (MLOps)"]
    G --> G1[Monitoring + drift]
    G --> G2[Retraining cadence]
    G --> G3[Canary + rollback + telemetry]
    G3 -.feedback.-> C

    style R fill:#16a34a,color:#fff
    style D fill:#2563eb,color:#fff
    style E fill:#2563eb,color:#fff
    style G fill:#7c3aed,color:#fff


Part II — Build: The Implementation Checklist

From thinking to shipping. Part I explained why the system looks the way it does. Part II turns it into an ordered build plan. The phases map onto the nine steps above — Phase 1 implements Step 4 (data), Phases 2–4 implement Step 5 (the funnel), Phase 6 implements Step 7’s streaming backbone, and so on — but now each item is a checkbox with a reason and each phase has a ✅ Definition-of-Done gate. Work top-to-bottom: the phases are in dependency order.

How to use it. Work top-to-bottom: the phases are in dependency order — you cannot rank candidates you haven’t retrieved, and you can’t retrieve from a feature store you haven’t built. Each phase ends with a ✅ Definition of Done (DoD) gate. Don’t advance until the gate is green. 🟢 = anyone can grasp it; 🔵 = engineering depth; 🟣 = architecture/business judgement.


The Build Order at a Glance

flowchart TD
    P0["Phase 0 · Foundations<br/>scope, metrics, infra-as-code"] --> P1["Phase 1 · Data & Event Backbone<br/>ingestion, schema, labels"]
    P1 --> P2["Phase 2 · Feature Store<br/>offline + online, no skew"]
    P2 --> P3["Phase 3 · Candidate Generation<br/>two-tower + ANN index"]
    P3 --> P4["Phase 4 · Ranking + Re-rank<br/>scoring, filters, guardrails"]
    P4 --> P5["Phase 5 · Serving Layer<br/>API, caching, degradation"]
    P1 --> P6["Phase 6 · Video Backbone<br/>encode, DRM, CDN, ABR"]
    P5 --> P7["Phase 7 · Evaluation<br/>offline metrics + A/B"]
    P6 --> P7
    P7 --> P8["Phase 8 · MLOps & Ops<br/>CI/CD, monitoring, drift"]
    P8 -.continuous feedback.-> P1

    style P0 fill:#16a34a,color:#fff
    style P2 fill:#2563eb,color:#fff
    style P3 fill:#2563eb,color:#fff
    style P4 fill:#2563eb,color:#fff
    style P6 fill:#0891b2,color:#fff
    style P8 fill:#7c3aed,color:#fff

Reasoning for this order. Phases 1–5 are the recommendation spine and must be serial (each depends on the last). Phase 6 (video delivery) runs in parallel off the same event backbone — a separate team can build it while the ML spine matures. Everything converges at evaluation (Phase 7), then operations (Phase 8) closes the loop back to data.


Phase 0 · Foundations — Decide Before You Code

🟢 The goal here is to make the invisible decisions explicit so nobody silently assumes a different system than their teammate.

  • Write down the single objective metric. Why: every later trade-off resolves by appealing to it; “long-term watch / retention,” not raw clicks, prevents clickbait optimization.
  • Convert NFRs into hard numbers. Why: “minimal latency” is undesignable. Pin targets: recommendation p99 < 200ms, video startup < ~2s, availability 99.99%. Latency budgets are constraints you spend, not wishes.
  • Set per-stage latency sub-budgets. Why: the total budget must be partitioned so no stage can blow it: retrieval < 10ms, ranking < 20ms, business rules < 5ms is a common split — tune to your numbers.
  • Choose cloud regions + multi-region strategy. Why: global users need low latency and failover; deciding this late forces painful rewrites.
  • Stand up Infrastructure-as-Code (Terraform) + a Git monorepo/polyrepo policy. Why: reproducible dev/staging/prod from one definition prevents “works on my cluster” drift.
  • Define data-privacy posture (GDPR/CCPA, India DPDP). Why: IP addresses and watch history are PII; retrofitting consent and redaction is far costlier than designing it in.
  • Pick the experimentation platform early. Why: if you can’t A/B test, you can’t prove anything works — this is infrastructure, not an afterthought.

DoD: a one-page “design contract” exists — objective metric, numeric SLOs, latency budget split, region plan, privacy posture — signed off by eng + product.


Phase 1 · Data & Event Backbone — The Real Engine

🟢 Models learn from examples. This phase guarantees we capture good examples and can trust them.

1.1 Ingestion & streaming

  • Stream interaction events through Kafka (plays, pauses, completes, likes, dislikes, shares, searches, skips). Why: these become both real-time signals and training labels; lose them and you’re blind.
  • Define a versioned event schema (with a schema registry). Why: an un-versioned schema silently corrupts downstream features when a field changes meaning.
  • Persist raw events to a data lake (S3 + Parquet/Delta) and a serving DB (Cassandra/Dynamo). Why: the lake keeps full history for training; the operational store serves reads at scale. Different jobs, different stores.
  • Set up batch + stream processing (Spark for batch, Flink for streaming). Why: daily aggregates need batch; “events in the last 5 minutes” need streaming. You need both engines.

1.2 Labels — the subtle part

  • Define positive labels (completed watch, like, re-watch) and negative labels (impression shown, not acted on). Why: implicit feedback is your label source; getting the negative definition wrong biases the whole model.
  • Log impressions, not just actions. Why: you can only learn “what was rejected” if you recorded “what was shown.” Missing impressions = no true negatives.
  • Plan for the feedback loop / exposure bias. Why: you only get labels for items you recommended; without exploration, the model calcifies around what it already shows.

DoD: an event flows end-to-end (client → Kafka → lake + serving store), is queryable, schema-validated, and you can reconstruct “user U was shown items X, acted on Y” for any timestamp.


Phase 2 · Feature Store — One Source of Truth, Zero Skew

🟢 A feature is a number describing the user or item (e.g. “movies watched this month”). The feature store guarantees the same numbers in training and in live serving.

🔵 This is where most ML projects quietly fail. Most ML projects fail not because of a bad model, but because of bad data in production; the feature store ensures the model in production gets the same quality features as during training.

2.1 The offline/online split

  • Build the offline store on the data lake (Delta/Parquet/BigQuery) with point-in-time-correct joins. Why: training on January data must use January feature values, not today’s — point-in-time joins prevent the model from seeing the future, which would inflate offline metrics and lie to you.
  • Build the online store on a low-latency KV store (Redis/DynamoDB/Cassandra). Why: serving needs feature lookups by user ID in milliseconds; latency under 10ms is standard, and for real-time recommendations this is critical.
  • Materialize features from offline → online on a schedule (e.g. Spark→Redis). Why: the online store keeps only the freshest slice for speed, synced from the historical store.

2.2 Guarding consistency

  • Verify train/serve parity for every feature. Why: the classic bug — a feature computed pre-tax in offline training but post-tax in online inference yields silently inaccurate predictions. Diff offline vs online values for the same entity at the same instant.
  • Classify features as batch / streaming / on-demand. Why: “avg watch over 30 days” is batch; “items in this session” is streaming; “distance to nearest CDN” is on-demand. Each needs a different pipeline.
  • Treat feature definitions as code in Git, with unit tests + data-quality checks. Why: a feature is a transformation; untested transformations rot.
  • Measure the real freshness gap (change a source value, time the served update). Why: the vendor’s marketing latency is not your latency; measuring the actual freshness gap is what tells you whether the feature store will meet your SLA.
  • Cache item features aggressively (multi-level: in-process + Redis). Why: item features have low cardinality and loose freshness needs; a feature service can see 100–1000x the QPS of the recommender, so caching item features cuts cost and tail latency dramatically.

DoD: the same feature returns identical values offline and online for a fixed entity + timestamp; online p99 lookup is within budget; feature definitions live in Git with tests.


Phase 3 · Candidate Generation — Retrieve Broadly, Cheaply

🟢 From millions of titles, instantly grab the ~1,000 most promising. Better to over-include than miss a gem (this stage chases recall).

flowchart LR
    U[User embedding] --> ANN
    subgraph Retrieval
        ANN["ANN search (FAISS/ScaNN/Milvus)"]
        CF[Collaborative filtering]
        CB[Content-based]
    end
    ANN --> M[Union + dedupe]
    CF --> M
    CB --> M
    M --> OUT["~1,000 candidates"]
    IDX[(Precomputed item<br/>embedding index)] -.serves.-> ANN
    style ANN fill:#16a34a,color:#fff
    style OUT fill:#16a34a,color:#fff
  • Train a two-tower model; precompute + index all item embeddings offline. Why: item embeddings rarely change minute-to-minute, so computing them offline removes the cost from the request path — running expensive deep two-tower or LLM-embedding models offline mitigates latency, and at serving time you just query the precomputed embeddings rapidly.
  • Stand up an ANN index (FAISS / ScaNN / Milvus / Redis vector). Why: exact nearest-neighbor over millions of items is too slow; ANN algorithms make candidate retrieval feasible at < 10ms.
  • Tune the recall/latency trade-off explicitly and benchmark on your data. Why: ANN trades precision for speed — e.g. ~90% precision at ~200ms median vs ~95% at ~1.3s on a billion vectors; you must benchmark with your own data and budget.
  • Run multiple retrieval strategies in parallel and union the results. Why: collaborative filtering gives high-quality candidates for known users but suffers cold-start; combining and unioning approaches covers each other’s blind spots.
  • Pick the embedding dimension deliberately (512–1024 is a common sweet spot). Why: moderate dimensions deliver the best balance of storage, query speed, and quality.

DoD: a user ID returns ~1,000 deduped candidates within the retrieval sub-budget; recall@K is measured against a held-out set; cold-start users get a sensible fallback set.


Phase 4 · Filtering, Ranking & Re-ranking — Be Precise, Then Be Safe

🟢 Now score just those ~1,000 candidates carefully (chasing precision), remove anything ineligible, and arrange the final list to serve the business, not only the model.

🔵 A clean way to structure this is the four-stage pattern: retrieval → filtering → scoring → ordering. NVIDIA Merlin and others use exactly this: retrieve a relevant set, filter unwanted items, score interest, then order to align with business constraints.

4.1 Filtering (guardrails first)

  • Apply hard filters: regional licensing, age/maturity, already-watched, removed titles. Why: these are correctness/compliance rules a model should never be trusted to learn — filters act as guardrails against bad user experience.
  • Add a “never empty results” safeguard. Why: a single misbehaving filter (or an upstream data outage behind one) can filter out all candidates and cause an outage; cap and monitor filter removal rates.
  • Split filters into offline (rarely change) vs online (frequently change). Why: static rules like geo-eligibility run offline; dynamic ones like moderation flags run online — matching cost to volatility.

4.2 Ranking

  • Serve a heavier ranking model (DNN / gradient-boosted tree) on the surviving candidates. Why: because it only scores a few hundred items, it can afford to be expensive and precise — the funnel earned this budget.
  • Prefer multi-task learning (predict watch, completion, like together). Why: consolidating signals into one shared multi-task model improves performance and simplifies the architecture.
  • Deploy on a real inference server (Triton / TF-Serving / TorchServe), not a hand-rolled loop. Why: purpose-built servers give batching, GPU sharing, and heterogeneous-model support out of the box.

4.3 Re-ranking / ordering

  • Inject diversity (don’t show five thrillers in a row), freshness, and exploration. Why: diversification ensures variety; pure model-score ordering collapses into monotony and starves new content.
  • Personalize per-row and per-artwork. Why: Netflix generates each home-row by a different algorithm and even personalizes thumbnails — same title, different image per taste — which lifts engagement well beyond list order alone.

DoD: end-to-end retrieval→filter→rank→order returns an ordered, policy-compliant list within total budget; a chaos test (kill the feature service) degrades gracefully instead of returning empty.


Phase 5 · Serving Layer — Fast, Cached, Unbreakable

🟢 The plumbing that takes a request and returns rows quickly — and never shows a blank screen.

sequenceDiagram
    actor User
    participant API as Recommendation API
    participant Cache as Result/Feature Cache (Redis)
    participant Ret as Retrieval (ANN)
    participant Rank as Ranking (Triton)
    participant FB as Fallback (popular/cached)

    User->>API: GET /home (userId, context)
    API->>Cache: lookup cached rows / features
    alt cache hit (anonymous or warm user)
        Cache-->>API: cached rows
    else cache miss
        API->>Ret: retrieve candidates
        Ret-->>API: ~1000 candidates
        API->>Rank: score + re-rank
        Rank-->>API: ranked rows
    end
    alt any component times out / errors
        API->>FB: fetch popular / last-known-good
        FB-->>API: safe fallback rows
    end
    API-->>User: personalized (or safe) home screen
  • Expose a stateless recommendation API behind a load balancer; auto-scale on latency + CPU. Why: stateless pods scale horizontally for traffic spikes (think a major launch) without sticky-session pain.
  • Cache at multiple levels: result cache for anonymous/cold sessions, feature cache for hot items. Why: caching recommendation results for anonymous users and popular embeddings, with TTLs tied to update frequency, slashes load.
  • Implement graceful degradation with explicit fallbacks. Why: when real-time systems fail, fall back to cached recommendations, popular items, or rules — never show empty results.
  • Add circuit breakers, timeouts, and bulkheads between services. Why: these isolate a failing dependency so one slow service can’t exhaust resources and cascade into a full outage.
  • Build observability in from day one (metrics, logs, traces). Why: you can’t improve what you can’t measure — observability is foundational, not a later add-on.

DoD: p99 within budget under load test; killing any single downstream still returns non-empty rows; dashboards show latency, error rate, cache hit ratio live.


Phase 6 · Video Backbone — Encode, Protect, Deliver (Parallel Track)

🟢 Recommending a title is useless if it won’t play instantly. This track (built in parallel by a separate team) ingests masters, encodes many versions, protects them, and serves from servers near the user.

flowchart LR
    ING["Ingest master (Mezzanine)"] --> QA["Per-shot complexity analysis (VMAF)"]
    QA --> ENC["Encode ladder (240p->4K, H.264/265/VP9/AV1)"]
    ENC --> SEG["Segment 2-10s chunks"]
    SEG --> PKG["Package HLS/DASH + DRM"]
    PKG --> DIST["Distribute to CDN edge (OCAs)"]
    DIST --> PLAY["Client ABR playback"]
    style ENC fill:#0891b2,color:#fff
    style DIST fill:#0891b2,color:#fff
  • Build the ingest → per-title/per-shot analysis → encode → segment → package pipeline. Why: one master becomes ~1,200 renditions; per-title analysis allocates bitrate by content complexity (animation needs less than action) for the same visual quality.
  • Use VMAF as the quality gate in encoding. Why: it’s an objective, ML-based quality score that lets you set quality targets rather than guessing bitrates.
  • Package to HLS/DASH and wrap with DRM (Widevine/FairPlay/PlayReady). Why: adaptive protocols enable quality switching; DRM is contractually required by studios.
  • Stand up CDN edge delivery (own OCAs or a commercial CDN to start). Why: caching popular content inside ISP networks minimizes transit, latency, and rebuffering. Build-vs-buy: buy a CDN early; build your own only at extreme scale.
  • Implement Adaptive Bitrate (ABR) on the client. Why: it measures bandwidth + buffer and up/down-shifts quality smoothly to avoid stalls when the network dips.
  • Serve tokenized, signed CDN URLs + select nearest healthy edge with fallback. Why: signed URLs prevent hotlinking; nearest-healthy-node selection with origin-shield fallback keeps playback resilient.
  • Rotate TLS certs and DRM keys without interrupting playback. Why: certs, DRM keys, and tokens must rotate via automated vault-backed workflows so a key roll never breaks a live stream.
  • Monitor QoE metrics: rebuffer ratio, startup time, CDN hit/miss, edge 4xx/5xx. Why: target hit ratio > 85% for VOD and edge error rate < 0.2%; if QoE regresses during a rollout, auto-rollback.

DoD: a new title ingests, encodes, packages with DRM, and plays on web/iOS/Android/TV with startup < 2s and smooth ABR switching; QoE dashboards are live.


Phase 7 · Evaluation — Prove It Twice

🟢 Check the model in the lab on past data, then on real users in a controlled live test. Lab-good can still be live-bad.

  • Compute offline ranking metrics: Recall@K, Precision@K, NDCG, AUC. Why: they’re cheap and catch gross regressions before users ever see them.
  • Use PR-AUC, not ROC-AUC, under heavy class imbalance. Why: with rare positives (most items aren’t watched), ROC-AUC looks deceptively good; PR-AUC tells the truth.
  • Run online A/B tests as the real verdict, with guardrail metrics alongside the primary. Why: production deployment of a new model goes through A/B testing before serving all traffic; guardrails (latency, error rate, complaints) ensure a “win” isn’t secretly harming experience.
  • Tie every experiment back to the Phase-0 objective, not the easy proxy. Why: optimizing clicks while claiming to optimize retention is how products quietly degrade.
  • Consider shadow deployment for risky models. Why: shadow deployments run the new model on real production requests without affecting responses — free real-world validation at zero user risk.

DoD: offline gate thresholds defined and met; at least one A/B test shows a statistically significant lift on the true objective with no guardrail regression.


Phase 8 · MLOps & Operations — Close the Loop

🟢 Tastes change and new shows arrive; a model left alone goes stale. This phase keeps it fresh, safe to update, and watched.

🔵 Traditional CI/CD only verifies deterministic code — models behave differently each retrain — so pipelines must add automated quality guardrails that reject candidates failing accuracy, bias, or drift checks.

8.1 CI/CD for ML

  • Pipeline: lint → validate data schema → retrain → eval-gate → register artifact. Why: the winning artifact lands in a registry only after passing evaluation, replacing manual packaging and cutting release cycles from weeks to hours.
  • Add data-validation gates before expensive retraining. Why: validation gates prevent corrupted datasets from triggering costly retraining runs.
  • Version everything: data, features, models, configs. Why: reproducibility and lineage are what let you debug “why did quality drop on Tuesday.”

8.2 Safe rollout

  • Roll out via canary (1% → 10% → 100%) with automatic SLO checks and auto-rollback. Why: canary serves a small slice first, auto-rolls-back on degraded metrics; Amazon’s canaries prevented dozens of customer-impacting model failures.
  • Keep blue-green capability for instant reverts. Why: blue-green flips traffic atomically and reverts by just switching the load balancer back — near-zero-downtime rollback.

8.3 Monitoring & retraining

  • Monitor data drift (input distribution shift) and concept drift (prediction quality decay). Why: a model flawless offline can drift silently once exposed to real traffic, and CPU/memory dashboards won’t catch it — you must watch feature distributions.
  • Trigger retraining on drift or on a schedule. Why: trigger retraining when new data lands so the model tracks the changing environment.
  • Monitor filter behavior over time, not just models. Why: filters drift with data too and need re-tuning; an unmonitored filter can quietly start removing too much.
  • Wire telemetry + automated canary analysis (e.g. Atlas-style metrics + Kayenta-style analysis). Why: automated canary analysis removes the human bottleneck of staring at dashboards to decide promotion.
flowchart LR
    Commit[Code/data commit] --> CI["CI: lint + data-validate + retrain + eval-gate"]
    CI -->|pass| Reg[(Model registry)]
    Reg --> Canary["Canary 1% -> 10% -> 100%"]
    Canary -->|SLO ok| Prod[Full production]
    Canary -->|regression| RB[Auto-rollback]
    Prod --> Mon["Monitor: drift, QoE, business KPIs"]
    Mon -->|drift| Commit
    style CI fill:#7c3aed,color:#fff
    style Canary fill:#2563eb,color:#fff
    style Mon fill:#16a34a,color:#fff

DoD: a model change flows commit → eval-gate → canary → prod automatically; a deliberately bad model is auto-rejected or auto-rolled-back; drift alerts fire on injected distribution shifts.


Cross-Cutting Checklist (Applies to Every Phase)

These don’t belong to one phase — neglecting them anywhere causes failures everywhere.

ConcernItemWhy it matters
SecurityAuthn/authz on every service; signed CDN URLs; secret rotationA streaming platform is a high-value target; one leaked DRM key or open endpoint is catastrophic
PrivacyConsent capture, PII redaction in logs, regional data residencyIP + watch history are PII under GDPR/CCPA/DPDP; redaction must be in the pipeline, not bolted on
CostTrack GPU-hours, CDN egress, vector-index RAMEgress is often the #1 cost line for streaming; idle GPU clusters drain budgets fast
ReliabilityMulti-region failover, chaos testingChaos engineering (randomly killing instances) proves recovery before a real outage tests it for you
DocumentationFeature ownership, runbooks, decision log“What does this feature mean and who owns it” prevents silent technical debt
AccessibilityCaptions, audio descriptions, WCAG-compliant UIBoth a legal requirement and a large audience; bake into the encode pipeline and client


TL;DR — The Whole Guide on One Page

Part I — the thinking (9 steps):

  1. Clarify functional + non-functional requirements; turn vague NFRs into numbers.
  2. Frame as a ranking problem; optimize long-term watch/retention, not raw clicks.
  3. Estimate scale — the “can’t-score-everything” math forces the funnel.
  4. Design data — sources, implicit labels, features, a feature store; mind the feedback loop.
  5. Two-stage funnel — Candidate Generation (recall, two-tower + ANN) → Ranking (precision, DNN/GBT/multi-task) → Re-rank (diversity + business rules).
  6. Train offline + online; plan cold-start fallbacks and exploration.
  7. Serve the recommender and the video — CDN, per-title encoding, ABR, microservices, resilience.
  8. Evaluate offline metrics then A/B test online with guardrails.
  9. Operate — monitor, detect drift, retrain, canary, rollback.

Part II — the building (9 phases, dependency-ordered):

  • P0 Foundations → signed design contract: objective metric + numeric SLOs + latency budget + privacy posture.
  • P1 Data backbone → Kafka → lake + serving store; implicit labels; log impressions; plan the feedback loop.
  • P2 Feature store → offline (point-in-time) + online (<10ms); prove zero train/serve skew; cache item features.
  • P3 Candidate generation → two-tower + offline-indexed embeddings + ANN; union strategies; chase recall.
  • P4 Filter→Rank→Re-rank → guardrails (never-empty) → multi-task DNN/GBT on a real inference server → diversity/freshness/per-artwork.
  • P5 Serving → stateless cached API + graceful degradation + circuit breakers + observability.
  • P6 Video backbone (parallel) → per-title encode + VMAF + HLS/DASH + DRM + CDN + ABR + QoE.
  • P7 Evaluation → offline (NDCG, PR-AUC) → online A/B with guardrails → tie to the true objective.
  • P8 MLOps → eval-gated CI/CD → canary + blue-green + auto-rollback → drift monitoring → retraining loop.
  • Cross-cutting → security, privacy, cost, reliability, docs, accessibility — everywhere, always.

The two sentences to keep: (Design) You can’t run a heavy model over millions of items per request — so retrieve broadly and cheaply, rank narrowly and precisely, then re-rank for the business. (Build) Build in dependency order, prove zero train/serve skew before anything else, never return an empty screen, and let no model reach 100% of traffic without an eval gate and a one-click rollback.


References

  1. System Design Handbook — Recommendation System Design (Step-by-Step)
  2. Interview Kickstart — ML System Design Interview Guide (2026)
  3. Design Gurus — AI/ML System Design Interview Roadmap (2026)
  4. Shaped.ai — Netflix Personalization, Recommendations & Search Workshop 2025
  5. Netflix Tech Blog — Lessons from Consolidating ML Models
  6. techinterview.org — System Design: Video Streaming (Netflix)
  7. SWE Helper — Design Netflix: A Video Streaming Architecture
  8. Talent500 — Netflix Architecture Deep Dive
  9. VdoCipher — Netflix Tech Stack: Open Connect & Microservices
  10. GeeksforGeeks — System Design Netflix
  11. Figures (subscriber counts, encoding variants, latency targets) are industry-reported and approximate — treat them as order-of-magnitude design inputs, not exact specs.

  12. Redis — Real-Time AI Recommendation Systems (2026)
  13. Databricks — What is a Feature Store?
  14. Databricks — How to Build an Online Recommendation System
  15. CORE Systems — Feature Store: Key Infrastructure for ML in Production (2026)
  16. Tacnode — How to Evaluate a Feature Store (2026)
  17. The ML Architect — Recommendation Systems: An Architect’s Playbook
  18. NVIDIA — Offline-to-Online Feature Storage with Merlin
  19. Appit Software — Building Real-Time Recommendation Engines
  20. Introl — MLOps Infrastructure: CI/CD Pipelines
  21. Galileo — The MLOps Guide to Production Success
  22. Google Cloud — MLOps: Continuous Delivery & Automation Pipelines
  23. BlazingCDN — CDN Integration with CI/CD for Media Apps
  24. Latency numbers, hit-ratio targets, and dimension ranges are industry-reported guidance — benchmark against your own data and SLAs before committing.

Figures (subscriber counts, encoding variants, latency targets, hit-ratio and dimension ranges) are industry-reported and approximate — treat them as order-of-magnitude design inputs and benchmark against your own data and SLAs.

This post is licensed under CC BY 4.0 by the author.