How a GPT Learns

The Task: Predict What Comes Next

Before architecture, before math — the learning signal is just one question repeated billions of times.

A GPT sees a sequence of tokens and tries to predict the next one. That's the entire training objective. Below is a name from the training set. Try to guess each next character yourself — you're doing exactly what the model does.

You just did one forward pass — looked at context, made a prediction. You probably got better as more letters appeared, because context reduces uncertainty. A GPT does the same thing: more context, sharper predictions. The difference is it does this across millions of names and adjusts its weights after every mistake.

Not One Guess — A Distribution

The model doesn't pick one letter. It assigns probability to every possible letter.

When you guessed above, you committed to a single answer. A GPT is more nuanced — it outputs a probability for each character in its vocabulary. Click "Predict" to see what an untrained model thinks comes after each prefix, then watch how a trained model's distribution differs.

The untrained model spreads probability almost uniformly — it has no idea what follows "ch". The trained model concentrates mass on letters like a, r, e, i — because it's seen thousands of names and learned which continuations actually occur. Training is the process of reshaping these flat distributions into peaked, accurate ones.

Measuring Mistakes: The Loss

If the correct letter got 2% probability, that's a big mistake. If it got 90%, barely a mistake at all.

The loss is the negative log of the probability the model assigned to the correct next token. Drag the slider to change how much probability the model placed on the right answer, and watch the loss respond.

Notice the shape: loss drops steeply as probability rises from near-zero, then flattens as it approaches 100%. This means the model gets heavily penalized for being confidently wrong, but gets diminishing returns for going from "pretty sure" to "very sure." This asymmetry drives the model to eliminate its worst mistakes first.

Learning by Blame: Backpropagation

The loss says how wrong the prediction was. Backprop says who's responsible.

Every prediction flows through layers of computation — embeddings, attention, MLPs. Backpropagation walks backward through this chain, computing a gradient for every weight: "if I nudge you up, does the loss go up or down, and by how much?"

Wait — every single weight gets its own gradient? In a real model that's billions of numbers!

Yes. In microgpt, every scalar is a Value object that tracks its parents in the computation graph. The backward pass walks this graph in reverse topological order. In production, PyTorch does the same thing but with GPU-accelerated tensor operations instead of one scalar at a time. The math is identical — the scale is different.

Each weight now has a direction to move that would reduce the loss. The optimizer (Adam) uses these gradients — plus momentum and adaptive scaling — to make the actual update. Then the whole cycle repeats: predict, measure loss, backpropagate, update.

Attention: Which Context Matters?

Not every previous token is equally useful. Attention learns to look at the right ones.

When predicting the last letter of a name, should the model focus on the first letter? The previous letter? Every position has a different relevance. Below is a heatmap of attention weights — each row shows where that position is "looking." Click different tokens to see the pattern shift.

Early tokens attend mostly to themselves and the start-of-sequence marker (they have little context). Later tokens spread attention across the sequence — especially to letters that help resolve ambiguity. The "e" at the end attends heavily to "tt" because in English names, "tte" is a strong signal that the name is ending. None of this is programmed — it emerges from gradient updates.

Training: Watching the Model Improve

Thousands of predict-measure-update cycles, compressed into one chart.

Press "Train" to simulate a training run. Watch the loss curve fall — fast at first (eliminating obvious mistakes like predicting 'z' after 'q'), then slowly (fine-tuning subtle patterns). The generated names below go from gibberish to plausible.

Early on, the model can barely string two consonants together. By the end, it produces names that feel right — "Marlina," "Jorah," "Elissa" — even though they might not exist. It learned the statistical texture of names: common openings, vowel-consonant rhythms, typical lengths. That texture is encoded in the weights, shaped entirely by the predict-and-correct loop.

Scaling Up: Same Algorithm, More Everything

The gap between microgpt and ChatGPT is scale, not mechanism.

Adjust the model's capacity below and see how generation quality changes. More embedding dimensions let the model represent subtler patterns. More layers let it compose those patterns. The core loop never changes.

If it's the same algorithm, why does GPT-4 seem qualitatively different from a tiny model?

Because scale enables emergence. A model with 4 dimensions can learn "vowels follow consonants." One with 12,288 dimensions can represent concepts like sarcasm, logic, and code structure — not because the algorithm told it to, but because the data contained those patterns and the model had enough capacity to capture them. More data, more parameters, more computation — same gradient descent loop.

The Full Picture, Built Up

Every piece you've seen, assembled into the complete training loop.

Step through the components below. Each one is something you've already interacted with — now see how they compose into the single cycle that runs trillions of times to produce a language model.

1 · Tokenize

Convert a document into a sequence of integer token IDs. In microgpt: one character = one token.

2 · Embed

Look up learned vectors for each token and its position. These are the model's raw inputs.

3 · Attend

Each position queries the others: "which context is relevant to me?" Attention weights are computed via Q·K dot products, masked so the model can't see the future.

4 · Transform

The MLP processes each position's blended representation through an expand → nonlinearity → project-back-down pathway. Residual connections preserve earlier information.

5 · Predict

Project to vocabulary size. Apply softmax to get a probability distribution over next tokens.

6 · Measure loss

Cross-entropy: −log(probability of the correct token). High when wrong, low when right.

7 · Backpropagate

Walk the computation graph backward. Every weight gets a gradient: its share of the blame.

8 · Update weights

Adam optimizer nudges each weight in the direction that reduces loss. Then repeat from step 1 with the next document. Forever.