Predict the next token. Measure how wrong you were. Adjust. That's the whole algorithm โ this page lets you feel each step.
Before architecture, before math โ the learning signal is just one question repeated billions of times.
A GPT sees a sequence of tokens and tries to predict the next one. That's the entire training objective. Below is a name from the training set. Try to guess each next character yourself โ you're doing exactly what the model does.
The model doesn't pick one letter. It assigns probability to every possible letter.
When you guessed above, you committed to a single answer. A GPT is more nuanced โ it outputs a probability for each character in its vocabulary. Click "Predict" to see what an untrained model thinks comes after each prefix, then watch how a trained model's distribution differs.
If the correct letter got 2% probability, that's a big mistake. If it got 90%, barely a mistake at all.
The loss is the negative log of the probability the model assigned to the correct next token. Drag the slider to change how much probability the model placed on the right answer, and watch the loss respond.
The loss says how wrong the prediction was. Backprop says who's responsible.
Every prediction flows through layers of computation โ embeddings, attention, MLPs. Backpropagation walks backward through this chain, computing a gradient for every weight: "if I nudge you up, does the loss go up or down, and by how much?"
Value object that tracks its parents in the computation graph. The backward pass walks this graph in reverse topological order. In production, PyTorch does the same thing but with GPU-accelerated tensor operations instead of one scalar at a time. The math is identical โ the scale is different.Not every previous token is equally useful. Attention learns to look at the right ones.
When predicting the last letter of a name, should the model focus on the first letter? The previous letter? Every position has a different relevance. Below is a heatmap of attention weights โ each row shows where that position is "looking." Click different tokens to see the pattern shift.
Thousands of predict-measure-update cycles, compressed into one chart.
Press "Train" to simulate a training run. Watch the loss curve fall โ fast at first (eliminating obvious mistakes like predicting 'z' after 'q'), then slowly (fine-tuning subtle patterns). The generated names below go from gibberish to plausible.
The gap between microgpt and ChatGPT is scale, not mechanism.
Adjust the model's capacity below and see how generation quality changes. More embedding dimensions let the model represent subtler patterns. More layers let it compose those patterns. The core loop never changes.
Every piece you've seen, assembled into the complete training loop.
Step through the components below. Each one is something you've already interacted with โ now see how they compose into the single cycle that runs trillions of times to produce a language model.