As of February 27, 2026, Stanford’s CS336 site labels the current term as Spring 2026, but that page is marked as under active construction and lists a tentative schedule.
This post summarizes the most recent complete offering, which is Spring 2025 (archived).
What This Course Is Actually About
CS336 is less “learn transformer equations” and more “learn to engineer language models under hard resource constraints.”
The organizing principle across the quarter is:
Maximize model quality under limited compute, memory, communication bandwidth, and data quality.
In Spring 2025, that objective is developed lecture by lecture, then forced into practice through assignments.
Lecture-by-Lecture Long Summary (Spring 2025)
The complete offering runs from April 1, 2025 to June 6, 2025.
Below is a long, technical summary of each scheduled lecture.
Lecture 1: Overview and Tokenization
The opening lecture frames the core motivation for CS336: frontier models are increasingly “industrialized” and opaque, while researchers still need mechanistic understanding to do serious work.
The lecture distinguishes three kinds of knowledge:
- Mechanics (what components do)
- Mindset (optimize efficiency under scale)
- Intuition (which design choices actually work)
Key framing arguments:
- “Scale alone” is a bad interpretation of progress; scalable algorithms matter.
- Smaller (<1B) course models are imperfect proxies for frontier systems, but they still teach transferable mechanics.
- The right mental model is to optimize accuracy = efficiency x resources.
The lecture also surveys LM history (n-grams -> neural LM -> seq2seq -> attention -> transformer -> scaling era -> open models), then maps out the full course stack: tokenization, architecture, training, kernels, distributed systems, inference, scaling laws, data curation, evaluation, and alignment.
Lecture 2: PyTorch Primitives and Resource Accounting
This lecture is a bottom-up systems tutorial for training loops:
- Tensor basics, storage, views, strides, and contiguous layouts
- Float formats (
fp32,fp16,bf16,fp8) and stability/range trade-offs - FLOPs accounting for linear layers and backprop
- Model FLOPs utilization (MFU) and throughput interpretation
- Parameter initialization and training stability intuition
- Data loading/memmap/pinned memory
- Optimizer mechanics (Adam-family lineage) and loop structure
- Mixed precision and practical checkpointing
The core habit it tries to build: do rough accounting before coding expensive runs.
Instead of hand-wavy “this model is big,” students are trained to ask:
- How much memory does each state tensor need?
- Where do forward/backward FLOPs concentrate?
- Which dtype decisions change stability vs speed?
Lecture 3: Architectures and Hyperparameters
Lecture 3 moves from canonical transformers to modern LM design conventions.
Major architecture topics:
- Pre-norm vs post-norm: why pre-norm became dominant
- LayerNorm vs RMSNorm: runtime/memory movement considerations
- Bias removal in many modern stacks
- FFN variants: ReLU/GeLU vs gated variants (SwiGLU/GeGLU)
- Serial vs parallel block variants
- Positional schemes: absolute/relative/RoPE
One major takeaway is methodological: architecture choices are often empirical and ecosystem-driven, but there are still recurring patterns (for example, pre-norm + RMSNorm + SwiGLU-like stacks in many recent open models).
Lecture 4: Mixture of Experts (MoE)
This lecture explains why sparse MoEs became mainstream:
- Higher parameter capacity at similar active FLOPs
- Strong open-model results at large scale
- Good fit for multi-device parallelism
Detailed topics include:
- Routing families (top-k token choice, hashing, assignment variants)
- Expert granularity and shared-expert patterns
- Load balancing heuristics and auxiliary losses
- Training stability issues (router behavior, precision choices)
- Systems concerns (sparse kernels, communication patterns)
- Upcycling dense checkpoints into MoE models
The lecture uses contemporary model families (including DeepSeek/Qwen-era designs) to show how “paper MoE” differs from production MoE.
Lecture 5: GPUs
Lecture 5 is the hardware foundations lecture:
- GPU vs CPU execution model (throughput-oriented SIMT)
- SMs, warps, memory hierarchy, and bandwidth constraints
- Why matmuls dominate and why memory still bottlenecks end-to-end speed
Optimization concepts covered:
- Roofline intuition
- Control divergence
- Precision effects on arithmetic intensity
- Operator fusion
- Recomputation trade-offs
- Coalesced memory access
- Tiling and wave quantization effects
The lecture closes by connecting these ideas to attention acceleration (FlashAttention-style reasoning), preparing students for kernel-level optimization work.
Lecture 6: Kernels and Triton
This is the practical kernel engineering lecture.
Workflow emphasized:
- Benchmark end-to-end behavior.
- Profile to find bottlenecks/kernels.
- Inspect lower-level behavior (including PTX-level clues).
- Re-implement critical paths (CUDA/Triton/compiled variants).
Key technical segments:
- Profiling matrix ops and MLP traces
- Kernel fusion demonstrations (e.g., GeLU variants)
- Writing CUDA extensions and validating correctness
- Writing Triton kernels and interpreting generated code
- Softmax and matmul optimization patterns
- Shared-memory tiling and L2-aware traversal behavior
This lecture is the bridge between “I know the algorithm” and “I can make it fast on real hardware.”
Lecture 7: Parallelism Basics
Lecture 7 is the conceptual map of large-scale training.
It starts with communication primitives (all-reduce, reduce-scatter, all-gather, etc.) and then compares parallelization strategies:
- Naive data parallelism
- ZeRO stages 1-3 / FSDP-style sharding
- Pipeline parallelism
- Tensor parallelism
- Sequence/activation parallel approaches
Important engineering points:
- Why naive DP is memory-inefficient
- Where ZeRO gives “almost free” memory wins vs added comm complexity
- Why pipeline bubbles depend heavily on microbatching
- Why tensor parallelism is typically intra-node (fast interconnect)
- Why activation memory remains a separate scaling challenge
Lecture 8: Distributed Training in PyTorch
Lecture 8 operationalizes distributed concepts in code:
- Collective primitives in
torch.distributed - NCCL/backend behavior
- Hardware topology constraints (NVLink/NVSwitch context)
- Communication benchmarking
- Bare-bones implementations of data/tensor/pipeline parallel workflows
This lecture emphasizes that distributed strategy is fundamentally communication scheduling plus sharding decisions.
It also reinforces a recurring course theme: simple abstractions hide real costs, so you must measure on your hardware.
Lecture 9: Scaling Laws Basics
Lecture 9 introduces scaling laws as practical planning tools, not just curves in papers.
Core topics:
- Historical context from sample-complexity ideas to neural scaling observations
- Power-law behavior for loss vs data/model/compute
- Why exponent estimation matters
- Hyperparameter scaling questions (batch, LR, depth/width tradeoffs)
- Joint data-model scaling and compute-optimal decision-making
- Chinchilla-style methods (lower envelope, IsoFLOPs, joint fits)
Practical lesson: use small-to-medium runs to predict large-run choices instead of tuning directly at full scale.
Lecture 10: Inference
Lecture 10 is a strong systems lecture on serving and decoding.
Main split:
- Prefill: parallelizable and usually more compute-friendly
- Decode/generation: sequential and typically memory-limited
Topics covered:
- Latency/throughput/TTFT trade-offs
- KV cache accounting and why decode becomes memory-bound
- Architecture-level cache-reduction methods:
- GQA
- MLA
- cross-layer KV sharing
- local/hybrid attention
- Quantization and pruning/distillation paths
- Speculative decoding (draft/target verification framing)
- Continuous/selective batching under dynamic request shapes
The key systems insight is that inference optimization is not one trick; it is a layered stack of architecture, caching, precision, and scheduling choices.
Lecture 11: Scaling Details and Case Studies
Lecture 11 moves from scaling-law theory to real model recipes.
Case studies discussed include public scaling disclosures from modern model families (e.g., CerebrasGPT, MiniCPM, DeepSeek-style analyses), with emphasis on:
- Practical hyperparameter scaling procedures
- Batch/LR scaling estimation
- Data-to-model ratio selection
- Compute-saving schedule choices (e.g., warmup-stable-decay variants)
- Comparing scaling-fit methods in practice
It also dives into muP (maximum update parameterization) concepts and why scale-invariant hyperparameter transfer is attractive (and where it can fail or need adaptation in modern stacks).
Lecture 12: Evaluation
Lecture 12 is a comprehensive “how to evaluate responsibly” framework.
It argues that evaluation is task-dependent and must specify:
- Inputs and coverage
- Model-calling protocol (prompting/tools/system setup)
- Output scoring methodology
- Interpretation assumptions
Benchmark families discussed include:
- Knowledge-style tests (MMLU, MMLU-Pro, GPQA, etc.)
- Instruction-following and preference-style evaluations (Arena-like setups, IFEval, AlpacaEval, WildBench)
- Agentic benchmarks (SWE-bench/CyBench/MLE-bench category)
- Safety/red-team evaluation styles
Additional focus areas:
- Realism vs benchmark convenience
- Train-test overlap and contamination concerns
- “Method vs system” evaluation ambiguity
Lecture 13: Data (Sources and Governance)
Lecture 13 broadens from “how to train” to “what to train on.”
Major content:
- Data lifecycle: live service -> dump/crawl -> processed corpus -> aggregated dataset
- Training stages:
- pretraining
- mid-training
- post-training
- Historical dataset tour (BERT/GPT-era through newer open-data pipelines)
- Common Crawl ecosystem and conversion/filtering concerns
- Synthetic data and distillation roles
It also includes legal/compliance framing:
- Copyright scope and constraints
- Licensing realities
- Fair use factors and ambiguity
- Terms-of-service constraints beyond copyright
The lecture’s central point is that data quality, provenance, and policy constraints are core model-quality variables.
Lecture 14: Data Processing (Filtering and Deduplication Mechanics)
Lecture 14 is algorithmic and implementation-focused.
Filtering toolkit:
- n-gram/KenLM-style scoring
- fastText-style classifiers
- distributional/importance-resampling ideas (DSIR framing)
Applications:
- Language identification
- Quality filtering
- Toxicity/harm filters
- Task/domain targeting (e.g., math-heavy corpora)
Deduplication mechanics:
- Exact hashing workflows
- Bloom filters for approximate membership
- MinHash + LSH for near-duplicate detection under Jaccard similarity
This lecture makes explicit that filtering is an optimization problem over quality, diversity, coverage, and compute budget.
Lecture 15: Alignment via SFT and RLHF
Lecture 15 transitions from pretraining behavior to instruction-following/aligned behavior.
Part 1 (SFT emphasis):
- What instruction datasets contain in practice
- Style effects (length/tone/bulleting) and how they influence model preference metrics
- Safety tuning with relatively small targeted data
- Why factual injection via finetuning can have non-trivial side effects
- Mid-training/instruction-mix strategies to scale alignment without catastrophic forgetting
Part 2 (RLHF setup):
- Why preference optimization is used after SFT
- Annotation pipeline realities (crowdsourcing quality, demographic effects, label consistency)
- Human-vs-AI feedback trade-offs in cost/scale
- PPO-era RLHF pipeline vs DPO simplifications
- Known failure modes (reward overoptimization, calibration/mode-collapse issues)
Lecture 16: RL from Verifiable Rewards (RLVR)
Lecture 16 continues DPO/PPO framing, then focuses on reasoning-era RL:
- PPO recap and complexity
- GRPO motivation (remove value function, use group-relative normalization)
- Objective behavior and known caveats (including normalization-induced quirks)
- Length-control concerns in reasoning training
The lecture then surveys successful open reasoning pipelines (R1/Kimi/Qwen-era trends) and highlights what repeatedly appears in working recipes:
- Strong SFT bootstrap
- Difficulty-aware data filtering
- Verifiable reward design
- RL loops with careful systems handling (rollout efficiency, uneven sequence lengths)
- Distillation from strong reasoning traces
Lecture 17: Policy Gradient Mechanics
Lecture 17 gives the mathematical and practical mechanics behind reasoning RL updates:
- Policy-gradient derivation for sequence generation
- Sparse reward variance issues
- Baselines and advantage intuition
- Group-relative normalization logic
- KL-regularized update intuition
- Toy experimental walkthrough showing how reward shaping choices alter training behavior
It is the “close the loop” lecture that makes A5-style algorithms legible from first principles rather than just library calls.
Lecture 18: Guest Lecture (Junyang Lin)
The Spring 2025 schedule lists this as a guest lecture; publicly linked detailed slides/materials are not attached on the archive page.
So, for this post, the concrete lecture-level summary is limited to the schedule metadata.
Lecture 19: Guest Lecture (Mike Lewis)
Likewise, the final guest lecture is listed in the schedule without a public lecture artifact linked directly from the archive table.
Summary here is therefore constrained to official schedule-level information.
Assignment Arc (How the Lectures Land in Code)
The lecture progression maps directly onto the five assignments:
- A1 Basics: tokenizer + transformer + optimizer + loop/checkpointing
- A2 Systems: profiling, FlashAttention-style optimization, DDP, optimizer sharding
- A3 Scaling: fit scaling laws and choose compute-optimal model setup under budget
- A4 Data: crawl-processing filters, PII/harm filtering, dedup, train/evaluate filtered corpus
- A5 Alignment/Reasoning RL: SFT + expert iteration + GRPO-style methods on reasoning tasks
Optional supplement: safety/instruction tuning + DPO-like preference optimization
Together they force students to integrate theory, infra, and evaluation in one continuous engineering pipeline.
Why This Offering Is Valuable
Spring 2025 CS336 is unusually strong because it does not separate “ML ideas” from “systems reality.”
- Architecture choices are taught alongside memory/throughput implications.
- Parallelism is taught with concrete communication primitives and failure modes.
- Scaling laws are treated as planning tools tied to actual budgets.
- Data quality and legal constraints are treated as first-class technical constraints.
- Alignment methods are taught with implementation details and known failure patterns.
That combination is exactly what most LLM work in practice demands.
Sources (Official Materials)
- Archived Spring 2025 course website (schedule, lecture links, assignments):
https://cs336.stanford.edu/spring2025/ - Current CS336 site (Spring 2026 page under active construction):
https://cs336.stanford.edu/ - Spring 2025 lecture materials repository:
https://github.com/stanford-cs336/spring2025-lectures - Assignment 1 (Basics):
https://github.com/stanford-cs336/assignment1-basics - Assignment 2 (Systems):
https://github.com/stanford-cs336/assignment2-systems - Assignment 3 (Scaling):
https://github.com/stanford-cs336/assignment3-scaling - Assignment 4 (Data):
https://github.com/stanford-cs336/assignment4-data - Assignment 5 (Alignment and Reasoning RL, including optional supplement):
https://github.com/stanford-cs336/assignment5-alignment