Module 2 · Model Mechanism

Attention & the Transformer

How the model actually uses the context you give it — selectively, at a price. Conceptual throughout; the linear algebra is linked, never required.

App dev · must know Enthusiast · nice to know Math · linked, not required

Gold = load-bearing for building on top of LLMs. Purple = deepens intuition but you can ship without it. Blue = the underlying mathematics, linked for the curious — you are never expected to derive it.

2.1Self-attention (intuition)

O(n²) cost what shares context in-context learning the QKV equation

The mechanism. For each token being produced, attention does a relevance-weighted lookup over every other token. Three steps:

1. SCORE     for the current position, compute one relevance number
             per previous token ("how much should I look at it?")
2. NORMALIZE push those scores through softmax → weights summing to 1
3. BLEND     build this token's context as a weighted sum of all the
             other tokens' content — relevant ones dominate, the rest ~0

Generating dose, the token 5mg two thousand positions back can score high and reach forward; the thousands of filler tokens score ~0 and are effectively ignored. Distance doesn't matter — relevance does. This is the capability that makes a long prompt usable, and it is the heart of the transformer.

What an application developer must know

Attention is all-pairs, so cost is quadratic — O(n²). Every token scores against every other, for every token. Double the prompt → ~4× the attention work. 10× the prompt → ~100×. This is why long context is expensive and slow, and why context windows are bounded.
Your bill is linear; your latency is quadratic — don't confuse them. APIs charge per token, so prepending 20k tokens of context costs roughly 200× a 100-token call on the invoice (linear in tokens). But the attention compute is n², so that same call is on the order of 40,000× the pairwise work — felt as prefill latency and GPU memory, never the invoice. Per-token pricing is a deliberate linear abstraction the provider lays over their quadratic compute. So: model cost by counting tokens; model latency and capacity by remembering the n².
Every design decision reduces to one question: what must share a single context, and what can be split? Tokens that need to attend to each other must be in the same call; unrelated tokens bundled together just pay n² for nothing. (Your "single responsibility per prompt" instinct is exactly right — refined to "single set of things that must interact.")
RAG is the engineered answer to the n² bill. Instead of stuffing 50 pages into one quadratic context, retrieve only the few chunks that must share context with the question — short, cheap, and the relevant info lands near the top.
In-context learning (ICL) is a free superpower. Because attention can pull from anything in the prompt, the model can follow a definition, mimic 3 examples, or adopt a format you give it with no training. Few-shot prompting works precisely because attention reads your examples at generation time.

The math, linked (for curiosity — not required)

The "score" step is softmax(QKᵀ/√d)·V — Query, Key, Value projections of each token. You will never need to derive this to debug behavior; treat Q/K/V as "learned ways of asking, advertising, and carrying content." If curious: Vaswani et al., 2017 — "Attention Is All You Need" (the founding paper), and the gentlest visual unpack, Jay Alammar's "The Illustrated Transformer".
3Blue1Brown — "Attention in transformers, visually explained" builds the QKV intuition without drowning you. The Annotated Transformer (Harvard NLP) is the paper as runnable code.

For the enthusiast

Sub-quadratic attention is a whole research frontier: FlashAttention (exact, just IO-aware and much faster), sliding-window / sparse attention, and linear-attention variants. Most production speedups you hear about live here.
Why ICL emerges at all is genuinely deep — see mechanistic-interpretability work on "induction heads" (Olsson et al., Anthropic, 2022).

Exercises

1. A 1,000-token prompt vs ten separate 100-token prompts — same total text. Which is cheaper on attention, by roughly how much?

Ten short ones, ~10× cheaper. One context: 1000² = 1,000,000 pairwise scores. Ten contexts: 10 × 100² = 100,000. The catch: the ten can't attend to each other, so you lose any cross-chunk relationships.

2. You have 50 docs; the user's question relates to 2. Naively you paste all 50. Diagnose and fix.

Problem: you pay n² over a huge context, most of it irrelevant, and risk lost-in-the-middle (§2.3). Fix: retrieve the ~2 relevant docs and put only those in context (RAG). Short, cheap, relevant.

3. Why can a model use a glossary you pasted earlier in the same prompt, with zero fine-tuning?

In-context learning: when generating, attention can score-and-pull the glossary tokens as relevant. The "knowledge" lives in the context, not the weights — so it works immediately and disappears when the prompt ends.

2.2Layers & the residual stream

depth = composition residual stream interpretability

The mechanism. A transformer stacks the attention operation dozens of times (an LLM has many layers). Each layer reads from and writes back to a shared residual stream — think of it as a communication bus running the length of the model that every layer can add information to. Early layers capture local/syntactic patterns; later layers compose them into higher-level meaning. Depth buys composition: the ability to build abstractions on top of abstractions.

What an application developer must know

Mostly this is "why bigger/deeper models reason better" — you rarely act on it directly. The one practical takeaway: a single forward pass has fixed depth, so there's a bounded amount of computation per token. Problems needing more steps than the depth allows must be unrolled across output tokens (chain-of-thought) — the M1 "output stream is the only workspace" point, seen from the architecture side.

For the enthusiast

The residual stream is the central object in modern mechanistic interpretability: "A Mathematical Framework for Transformer Circuits" (Elhage et al., Anthropic, 2021) treats it as a shared bandwidth that heads read/write — the basis for finding actual circuits inside models.
3Blue1Brown — "But what is a GPT?" visualizes the full stacked-layer flow.

Exercise

1. Why doesn't "just one giant attention layer" work as well as many stacked layers?

A single layer can only relate tokens once; it can't build on its own output. Stacking lets layer N operate on the abstractions layer N-1 produced — composition. Depth, not just width, is what lets the model form multi-step structure within a single forward pass (up to its fixed budget).

2.3Positional encoding / RoPE

lost in the middle put key info at edges RoPE rotations

The mechanism. Attention as described treats the prior tokens as an unordered bag — relevance scoring doesn't inherently care whether 5mg came before or after dose. But order obviously matters ("dog bites man" ≠ "man bites dog"). So position information is added to every token before attention sees it: each token effectively carries a stamp for where it sits, letting the model factor order and distance into relevance. The modern method, RoPE, encodes position as a rotation (math linked below — the idea is all you need).

What an application developer must know

"Lost in the middle" is real and counterintuitive. In a long context, positions are not weighted evenly — models reliably attend to the beginning and end, and information stranded in the middle gets systematically under-used (Liu et al., 2023). Bury the one clause that matters on page 15 of 30 and the model may act like it isn't there.
So: put critical instructions/facts at the start or the end of a long prompt. Retrieve-and-place the few relevant chunks near the top rather than dumping everything. Re-state crucial constraints at the end if the prompt is huge.
Sectioning a single concatenated prompt with "File 1 / File 2" headers does not fix lost-in-the-middle — the model still sees one flat token stream; a mid-stream clause is still mid-stream. Headers are cosmetic. Only separate calls change the position story (at the cost of cross-context attention).
Advertised context length ≠ usable context length. A "1M-token" model may use the middle far worse than the ends — verify on your own retrieval task, don't trust the number on the box.

The math, linked (for curiosity — not required)

RoPE rotates each token's query/key vectors by an angle proportional to its position, so relative position falls out of the dot product. Paper: Su et al., 2021 — "RoFormer: Rotary Position Embedding". You do not need the trig to use any of the must-knows above.

For the enthusiast

Context-length extension (running a model past its trained length) is largely RoPE-interpolation tricks — "Position Interpolation," NTK/YaRN scaling. Why models can sometimes be stretched to longer contexts after training.
Rigorous long-context evaluation: RULER (Hsieh et al., 2024) measures the real usable context size — usually far below the advertised one.

Exercises

1. A critical instruction buried in the middle of a 100k-token context is ignored. Diagnose and fix.

Lost in the middle — middle positions are under-attended. Fix: move the instruction to the start or end; or retrieve only the relevant context so it isn't buried; or repeat the constraint at the very end.

2. Does adding "File 1:", "File 2:" headers to a concatenated prompt fix the problem? Why or why not?

No. The model has no concept of "files" — it sees one token stream, and the buried content is still positionally in the middle. Headers are cosmetic. Only splitting into separate calls shortens contexts (trading away cross-file attention).

2.4Context window / KV cache

context ≠ memory cost scales with context prompt caching KV cache · GQA

The mechanism. The context window is simply the token sequence the model can attend to right now — your prompt plus everything generated so far, up to a fixed cap. There is no storage beyond it. To avoid recomputing attention for earlier tokens every step, the per-token key/value vectors are cached — the KV cache — which is what makes generation tractable and what consumes memory as context grows.

What an application developer must know

Context ≠ memory. The model remembers nothing between separate API calls. A "conversation" only works because your client resends the whole history every turn. Stateless by default — persistence (chat history, summaries, vector stores) is your job.
Cost and latency grow with context length, every turn. Turn 10 of a chat is more expensive than turn 1 because you're re-sending and re-processing 10 turns of history. Manage it: truncate, summarize, or window old turns.
Prompt caching cuts this. If a large prefix (system prompt, big document) is reused across calls, providers can cache its KV state so you don't pay full price to re-process it — large cost/latency wins. See Anthropic prompt caching / equivalent OpenAI caching. Structure prompts with the stable part first to maximize cache hits.
"Context rot": as a context fills toward its limit, output quality and instruction-following degrade — not a hard cliff but a gradual decline, compounded by lost-in-the-middle. More context is not free and not always better; relevant-and-short beats huge-and-noisy.

For the enthusiast

The KV cache is often the real memory bottleneck for long context. Mitigations you'll see named: Grouped-Query Attention (GQA) and Multi-Query Attention (shrink the cache by sharing keys/values across heads), plus paged KV caches (vLLM's PagedAttention).
FlashAttention again: it doesn't reduce the n² work but makes it memory-efficient and fast, which is why long contexts became practical.

Exercises

1. Why doesn't the model remember what you told it in a previous, separate API call?

Nothing persists outside the context window, and a new call is a fresh window. Multi-turn chat works only because the client resends prior turns. Memory is the application's responsibility.

2. Why does turn 10 of a chat cost more than turn 1?

Each turn re-sends and re-processes the accumulated history, so the input grows every turn — more tokens in, more attention work. Prompt caching and history summarization/truncation are the levers.

3. You reuse a 20-page policy document across thousands of requests. What's the optimization?

Prompt caching: place the stable document as a fixed prefix so its KV state is cached and not re-processed each call — big cost and latency savings. Put the variable part (the user's question) after the cached prefix.

Exit-test connections

M2 directly feeds these

"Which of two prompts costs more?" — §2.1 quadratic attention + §2.4 cost-scales-with-context, on top of M1's token counting. This is the cost-prediction question, end to end.
"Why isn't last week's fact in the model, but works once pasted in?" — §2.1 in-context learning + §2.4 context ≠ memory: the fact isn't in the weights (cutoff), but in the context the model attends to it directly. (Parametric vs. retrieved knowledge is finished in M4.)
Debugging long-context behavior — §2.3 lost-in-the-middle and §2.4 context rot are the two failure modes you'll hit most when prompts get big.

Full reference list

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. arXiv:1706.03762
Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864
Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL. arXiv:2307.03172
Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention. NeurIPS. arXiv:2205.14135
Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models. EMNLP. arXiv:2305.13245
Hsieh, C.-P., et al. (2024). RULER: What's the Real Context Size of Your Long-Context LMs? arXiv:2404.06654
Elhage, N., et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic. transformer-circuits.pub
Olsson, C., et al. (2022). In-context Learning and Induction Heads. Anthropic. transformer-circuits.pub
Alammar, J. The Illustrated Transformer. jalammar.github.io
Rush, A., et al. The Annotated Transformer. Harvard NLP. nlp.seas.harvard.edu
3Blue1Brown. Transformers / Attention (visual). GPT · Attention
Anthropic. Prompt caching. docs.anthropic.com

Attention & the Transformer

2.1Self-attention (intuition)

Exercises

Further reading

2.2Layers & the residual stream

Exercise

Further reading

2.3Positional encoding / RoPE

Exercises

Further reading

2.4Context window / KV cache

Exercises

Further reading

Exit-test connections

Full reference list