Module 1 · Model Mechanism

The Inference Loop

The spine of how a language model turns your prompt into text — one token at a time. Everything else in the course hangs off this.

App dev · must know Enthusiast · nice to know Math · linked, not required

Gold = load-bearing for building on top of LLMs. Purple = deepens intuition but you can ship without it. Blue = the underlying mathematics, linked for the curious — you are never expected to derive it.

1.1Tokenization (BPE)

cost & latency unit letter-level failures BPE algorithm

The mechanism. A model never sees characters. Before anything happens, your text is chopped into tokens — subword chunks — and each chunk is replaced by a single integer ID into a fixed vocabulary (~100k–200k entries for modern models). The word strawberry might arrive as ["str","aw","berry"] → [1338, 707, 19772]. Each ID is an opaque atom: there is no "look inside token 19772 and see b‑e‑r‑r‑y" unless the model separately learned what that token expands to.

This single fact — letters are fused inside atoms — explains a family of otherwise-baffling failures.

What an application developer must know

Tokens are the unit of billing and latency, not words or characters. Every API charges per token; context limits are in tokens. Use the OpenAI tokenizer or tiktoken to count before you ship.
Letter-level tasks are unreliable — counting letters, reversing strings, "words starting with X," strict character limits. The letters aren't visible to the model as separate things. If you need this, do it in code, or have the model spell the word out first (see §1.2).
Non-English text costs more. English is heavily represented in tokenizer training, so it packs efficiently (~1.3 tokens/word). Hindi, Japanese, Thai, etc. can take 2–5× more tokens for the same meaning — directly more expensive and slower, and it eats your context budget faster (Petrov et al., 2023, "Language Model Tokenizers Introduce Unfairness Between Languages").
Formatting affects tokenization. Whitespace, leading spaces ("hello" vs " hello" are different tokens), and number grouping all shift the token stream — a real source of subtle prompt and JSON-emission bugs.

For the enthusiast

Byte-Pair Encoding (BPE) builds the vocabulary greedily: start from bytes/characters, then repeatedly merge the most frequent adjacent pair into a new token, until you hit the target vocab size. Frequent words become single tokens; rare words fragment. Original idea: Sennrich, Haddow & Birch, 2016. GPT-2 introduced byte-level BPE so any Unicode is representable.
Andrej Karpathy builds a working BPE tokenizer from scratch in "Let's build the GPT Tokenizer" (and minbpe). The single best 2 hours you can spend here.
Hugging Face's tokenizers library and its tokenizer course chapter cover BPE, WordPiece, and Unigram side by side.

Worked examples

Why it can't count the r's in "strawberry"

The word is ~3 opaque IDs. To count r's the model must recall what letters each ID expands to (a learned, approximate association) and sum across chunks — it can't read characters that aren't present. Spelling the word out is easy because that letter sequence is a memorized association seen constantly in training; counting requires computing over a sequence that isn't laid out in front of it.

Why pasted JSON sometimes corrupts

Token boundaries don't respect your delimiters. A model emitting structured output is predicting tokens, not characters, so a brace or quote can land inside a merged token in a way that breaks strict parsers. Mitigations: constrained/structured decoding, JSON mode, schema validation + retry — never assume hand-rolled string concatenation is safe.

Exercises

1. Will a model reliably count the l's in "parallel"? Predict, then explain.

No — same mechanism as strawberry. The letters are fused inside subword tokens, so counting requires recalling each token's spelling and summing, which is unreliable. Fix: ask it to spell p-a-r-a-l-l-e-l first (externalize the letters into the token stream), then count.

2. You're billing customers per request and most are Japanese-speaking. What's the trap?

Japanese tokenizes to far more tokens than the equivalent English, so identical-feeling requests cost noticeably more and hit context limits sooner. Action: measure token counts on real localized samples, not English; price/limit accordingly.

3. Hands-on: paste a sentence and the same sentence with weird spacing into the OpenAI tokenizer. What changes?

Token counts and boundaries shift with leading/trailing/double spaces and newlines. Takeaway: prompt formatting isn't free or invisible — it changes the actual input the model receives.

1.2The autoregressive loop

latency · streaming can't revise · commit-and-justify KV cache · speculative decoding

The mechanism. Generation is one loop, repeated:

run the whole token sequence through the fixed weights
   → out comes ONE thing: a probability distribution over the next token
   → pick one token  →  APPEND it
   → feed the whole, now-longer sequence back in
   → repeat, left to right, until a stop token

Three consequences fall straight out of this shape, and they are not quirks — they are structural.

What an application developer must know

Latency scales with output length, sequentially. Each token is a full forward pass and they happen one after another. A 500-token answer takes roughly 5× as long to generate as a 100-token answer. The input is processed in parallel (the "prefill"); the output is the slow, serial part. Shorter outputs = faster responses.
Streaming exists because of this. Tokens are produced one at a time, so you can stream them to the UI as they appear — that's why ChatGPT "types." Use streaming to cut perceived latency.
The model cannot un-say a token. Once emitted, a token is frozen input for everything after it. If it starts down a wrong path ("The capital is Sydney…"), the most coherent continuation is often to keep going and justify it — a structural source of confident wrong answers.
"Self-correction" is just more tokens. When a model writes "…oh wait, that's wrong," it isn't editing — it can only append a retraction. There is no edit buffer. This is why letting a model "think out loud" (chain-of-thought) works: its output stream is its only workspace. No tokens spent reasoning = no place to reason.

For the enthusiast

KV cache: re-running the entire sequence every token sounds wasteful — and it would be, except the per-token key/value vectors from earlier positions are cached and reused. This is why prefill is fast and why memory, not just compute, bounds context length. (Mechanics in §2.4.)
Speculative decoding uses a small "draft" model to propose several tokens that the big model verifies in one pass — same output distribution, fewer slow steps. Good mental model for "why is inference suddenly faster" in modern stacks.
See the loop built end to end in Karpathy's "Let's build GPT from scratch" and nanoGPT; the visual version is Jay Alammar's Illustrated GPT-2.

Exercises

1. A model answers "The CEO is Jane Doe," but the real CEO is someone else. What does it tend to do next, and why?

It tends to continue as if Jane Doe were correct — inventing plausible supporting detail — because the wrong token is now part of the input it conditions on, and the most coherent continuation builds on it. It can't delete the claim; it can only append. This is commit-and-justify, a root of confident hallucination.

2. Why does "think step by step" reliably help on multi-step arithmetic or logic?

A single forward pass does a fixed, bounded amount of computation. Hard problems need more steps than fit in one pass, so the model must spread the work across tokens, using its own output as scratch memory. "Think step by step" elicits those intermediate tokens; without them, there's nowhere to do the work.

3. Your product needs sub-second responses. Two knobs in this section affect that — name them.

(a) Output length — generation is serial, so cap max_tokens / ask for concise answers. (b) Streaming — stream tokens so the user sees output immediately even if the full answer takes longer. (Prompt/prefill length matters too, but that's the parallel part.)

1.3Logits → softmax → sampling

temperature · top-p nondeterminism variance ≠ correctness softmax

The mechanism. A forward pass does not output probabilities. It outputs logits — one raw, unbounded score per vocabulary token. Those are turned into a probability distribution by softmax (exponentiate each, divide by the sum so they total 1). Then the next token is sampled: drawn at random, weighted by those probabilities. A token at 8% is chosen ~8 times in 100 — not ignored for losing to the top token.

That random draw, sitting between "compute the distribution" and "emit a token," is the entire source of nondeterminism. Same frozen weights + same prompt ⇒ same distribution — but a fresh weighted coin-flip each run, so you can get different answers.

Temperature: a knob on the distribution's shape

Temperature divides the logits before softmax (logit / T). It rescales how peaked vs. flat the distribution is — but it never re-ranks, because dividing every logit by the same number preserves their order. Two tokens with raw logits 4 and 1:

T	logits ÷ T	probabilities	behaviour
→ 0	→ ∞ gap	top → 1, rest → 0	greedy / argmax — deterministic
0.5 (low)	8, 2	~99.7% / ~0.3%	sharper, "safe", can get repetitive
1.0	4, 1	~95% / ~5%	the model's native distribution
2.0 (high)	2, 0.5	~82% / ~18%	flatter, more varied
→ ∞	→ 0 gap	→ uniform (1/V each)	max chaos → word salad

Note the ceiling: the second-place token's probability rises with temperature but only ever reaches parity (1/V) as T→∞ — it can never overtake the top token. top_p (nucleus) and top_k are complementary knobs that truncate the tail (only sample from the smallest set covering p% of probability, or the top k tokens) before the draw — usually a better default than high temperature for "creative but not insane."

What an application developer must know

Pick temperature by goal, not by reading the probabilities. Low (0–0.3) for extraction, classification, code, anything with one right answer. Higher (0.7–1.0) + top_p for brainstorming, copy, variety. The distribution is the model's confidence; temperature is your external dial.
Temperature controls variance, not correctness. If the model is confidently wrong (high probability on a wrong token), no temperature fixes it — low temp makes it reliably wrong; high temp makes it occasionally-something-else, no more truthful, and you can't tell which sample is right. The only fixes live upstream: change the weights (fine-tune) or change the context (put the right info in the prompt — RAG). Context is the lever you actually control.
Temperature 0 is "deterministic" — with an asterisk. It selects greedy/argmax, so in principle the same input gives the same output. In practice you can still see variation from floating-point non-associativity, server-side batching, and MoE routing (covered in M5). Don't promise byte-identical reproducibility.
Prefer top_p over cranking temperature when you want controlled creativity without garbage.

The math, linked (for curiosity — not required)

Softmax with temperature: p_i = exp(z_i / T) / Σ_j exp(z_j / T). One-page reference: Wikipedia — Softmax function. That's the whole thing; you do not need its derivative.

For the enthusiast

Why sampling at all? Pure greedy decoding produces flat, repetitive, degenerate text and gets stuck in loops. The classic study is Holtzman et al., 2020 — "The Curious Case of Neural Text Degeneration", which introduced nucleus (top-p) sampling. Read this if you read one thing on decoding.
Hugging Face's "How to generate text" walks through greedy, beam, top-k, and top-p with runnable code.

Exercises

1. Logits are [3, 1, 0]. Compute softmax at T=1. Which token wins? Now T=0.5 — does the winner change?

T=1: exp(3,1,0)=20.1, 2.72, 1.0 → ~84% / 11% / 4%. T=0.5: logits become [6,2,0] → exp=403, 7.4, 1 → ~98% / 1.8% / 0.2%. The winner never changes (token 0); only the gaps sharpen. Temperature reshapes, never re-ranks.

2. The same prompt at temperature 0.8 returns two different answers. Source, in one sentence?

The next token is drawn at random weighted by the distribution; same distribution, different dice rolls, and one early differing token cascades into a different continuation.

3. A colleague sets temperature 0 "to make the model accurate." Critique it.

Temperature 0 makes it deterministic and conservative, not accurate. It locks onto the model's single most-likely token — which can be confidently wrong. Reproducible ≠ correct. To improve accuracy, change the inputs (better context/RAG) or the model, not the sampler.

4. Model is 97% on a wrong drug dose. Is there a temperature that gets the right answer? What does?

No reliable one — temperature only changes how often you stray from the top token (which stays wrong), and the correct value may carry near-zero mass and be unidentifiable even if sampled. The fix is upstream: put the authoritative dosing reference in the context (RAG), or use a model whose weights are right, or call a tool/database.

Exit-test connections

M1 directly feeds these

"Why two different answers at temperature 0?" — §1.3: temp 0 selects greedy/argmax (should be deterministic); the residual variation comes from float non-associativity, batching, and routing (M5). The sampling story is here; the deep wrinkle is M5.
"Why a confident, wrong citation?" — §1.2 commit-and-justify + §1.3 confidence ≠ correctness. The full structural story lands in M3.
"Which of two prompts costs more?" — §1.1: tokens are the unit; non-English and verbose formatting inflate the count. (Attention's quadratic term is in M2.)

Full reference list

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL. arXiv:1508.07909
Petrov, A., et al. (2023). Language Model Tokenizers Introduce Unfairness Between Languages. NeurIPS. arXiv:2305.15425
Holtzman, A., et al. (2020). The Curious Case of Neural Text Degeneration. ICLR. arXiv:1904.09751
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML. arXiv:2211.17192
Karpathy, A. Let's build the GPT Tokenizer (video) · minbpe
Karpathy, A. Let's build GPT: from scratch (video) · nanoGPT
Alammar, J. The Illustrated GPT-2. jalammar.github.io
OpenAI. Tokenizer. platform.openai.com/tokenizer · tiktoken
Hugging Face. How to generate text (blog) · Tokenizers course (ch.6)

The Inference Loop

1.1Tokenization (BPE)

Worked examples

Exercises

Further reading

1.2The autoregressive loop

Exercises

Further reading

1.3Logits → softmax → sampling

Temperature: a knob on the distribution's shape

Exercises

Further reading

Exit-test connections

Full reference list