How LLMs Actually Work: A 4-Step Mental Model

You’ve probably sent a few hundred prompts to ChatGPT, Gemini, or Claude by now. But if someone asked you in an interview how the model actually produces its response, you’d probably say something like “it reads your question and figures out the best answer.”

That’s not wrong, exactly. But it skips the mechanism. And the mechanism matters, because it explains why models hallucinate, why the same prompt gives different answers each time, why longer prompts cost more, and why context windows have hard limits.

Here’s how LLMs work: a mental model in four steps. It won’t make you an ML researcher. But in my experience, it changes how you debug every prompt you write.

Try this hands-on: TinkerLLM Lesson 9 builds this exact mental model through exercises with a live model. You’ll see how tokenization, attention, and sampling interact in real time.

Open Lesson 9: What Is an LLM? →

Step 1: Your Text Becomes Tokens (Not Words)

Before an LLM sees a single character of your prompt, a tokenizer splits your text into chunks called tokens. Tokens are not words. They’re not characters. They’re subword units that follow frequency patterns in the training corpus.

A few examples:

“ChatGPT” splits into ["Chat", "G", "PT"] (3 tokens)
“strawberry” splits into ["straw", "berry"] (2 tokens)
“Bangalore” splits into ["Bang", "alore"] (2 tokens)

This is why models answer “how many r’s are in strawberry?” incorrectly. The model never saw individual letters. It processed ["straw", "berry"]. When you ask it to count characters, it’s working backward from token chunks, not inspecting the string. It guesses, and guesses wrong.

I’ve seen this pattern trip up developers who’ve built production LLM apps. They assume the model “sees” text the way a human reads it. It doesn’t. You can verify this with the OpenAI tokenizer tool. Paste any word and watch it split. Gemini uses a similar SentencePiece tokenizer that you can probe through the Gemini API’s countTokens method.

Token count is also what drives cost. Most APIs charge per thousand tokens, not per word. Your system instructions, the full conversation history, and the model’s response all count toward the context window on every single call. Gemini 2.5 Flash has a 1M token context window. GPT-4o has 128K. Neither is unlimited.

For a deeper look at tokenization and how it affects cost and context limits, see Tokens Explained →.

Step 2: Every Token Reads Every Other Token (Attention)

Once your text is tokens, the transformer architecture processes them through something called attention. The key insight is that attention is parallel: every token can “look at” every other token at the same time.

This is different from how you’d intuitively expect a model to work. You might imagine it reads left to right like a person. It doesn’t. When the model processes “the bank approved the loan after reviewing the deposit history,” the word “bank” attends to “approved,” “loan,” “deposit,” and “history” all simultaneously. The attention mechanism calculates how much each token should influence every other token’s representation.

This is what lets transformers handle long-range dependencies. Consider “The key was left in the car but the locksmith couldn’t reach it.” The word “it” refers to the car, not the key. Attention can resolve this because it weighs the full context in one pass, not sequentially.

But it comes at a cost. Attention computation scales roughly as O(n²) with sequence length. Doubling your context length roughly quadruples the computation. A 100K-token context isn’t 10 times as expensive as a 10K-token context; it’s closer to 100 times. This is why longer conversations cost more, and why LLM providers have been pushing architectural innovations to extend context windows affordably.

For you as a prompter, our main takeaway here is: position matters less than clarity. The model can attend to anything in the context, but dense, clear instructions still outperform vague, scattered ones regardless of where they appear.

Step 3: The Model Predicts. It Doesn’t Retrieve.

This step is the one that, in my view, changes how you interpret everything that comes out of an LLM.

The model doesn’t have a fact database it looks things up in. There’s no search index, no structured knowledge store, no truth-checking layer. What it has is a probability distribution over the next possible token.

Given “The capital of France is,” the model calculates: what token is most likely to come next? “Paris” has very high probability. “Lyon” has lower probability. “Penguin” has near-zero probability. The model picks from that distribution.

Most of the time, the most probable token is also the factually correct one. That’s because training data contained enough sentences about Paris being the capital of France to push “Paris” to the top of the distribution. Correct predictions dominate because training data is mostly correct.

But when the model hits a gap (a name it barely encountered, a fact published after its training cutoff, a question where multiple answers would be equally plausible in training data), it still predicts. It picks whatever the distribution says is most likely. And that prediction might be wrong, delivered with the same confidence as a correct answer.

This is the root cause of hallucinations. Not a bug. The core mechanism, applied to a case where training data didn’t provide a reliable signal. There’s no honesty layer that says “I’m not sure here.” There’s only probability, all the way down.

If you want to understand this in depth, AI Hallucinations: When Models Lie Confidently → covers the four specific failure modes in detail.

Step 4: Sampling Picks the Token

Once the model has a probability distribution over the next token, it still has to actually choose one. That’s what the sampling parameters control, and understanding our options here is where a lot of prompt engineering clicks into place.

Temperature scales the distribution before sampling. At temperature 0, the model always picks the highest-probability token and output is deterministic. At temperature 1.0, it samples proportionally from the raw distribution. Above 1.0, lower-probability tokens get boosted and output becomes more unpredictable.

Top-K constrains the pool. Instead of sampling from all ~32,000 vocabulary tokens, the model only picks from the K most probable ones. Top-K 40 means only 40 tokens are in contention each step, regardless of how the probabilities are distributed.

Top-P (nucleus sampling) is similar but dynamic. It picks the smallest set of tokens whose combined probability exceeds the threshold P. Set top-P to 0.95, and the model samples from whatever set of top tokens collectively covers 95% of probability mass. If that set is 3 tokens, only 3 are eligible. If it’s 40, all 40 are eligible.

The Gemini API exposes all three. Claude’s API exposes temperature and top-P. OpenAI exposes temperature, top-P, and frequency/presence penalties.

This is why the same prompt gives different outputs on different runs. Sampling is stochastic by default. Set temperature to 0 and you’ll get the same output every time for the same input (with minor caveats about infrastructure-level non-determinism in some providers).

Knowing this changes how you configure prompts. Code generation works best near temperature 0. Brainstorming benefits from 0.8 or higher. If your outputs feel randomly off, temperature is the first dial to check.

Why This Mental Model Changes How You Prompt

Once you see LLMs as prediction machines, a lot of prompt failures start making sense.

You’re not asking the model to “think” or “understand.” You’re giving it a context that shapes a probability distribution over the next token. Clearer prompts work better because they narrow the prediction space. “Write a Python function that takes a list of dicts and returns them sorted by a given key” gives the model much more pattern to match against than “write some Python code.”

System prompts bias every token prediction for the whole conversation. A well-designed system prompt doesn’t just instruct the model; it shifts the distribution toward a target tone, format, and vocabulary for every single response.

And when you’re debugging something that went wrong, you’re really asking: what in this context made that wrong token sequence more probable than the right one? Usually the answer is a vague instruction, a conflicting constraint, or a concept the model’s training data didn’t anchor reliably.

Our four-step model is a diagnostic tool as much as an explanation. Tokenization, attention, prediction, sampling. When something breaks, one of those four steps is usually where to look.

Try It Yourself

The best way to build this mental model is to run exercises that make each step visible. TinkerLLM Lesson 9 (“What Is an LLM? Building Your Mental Model”) walks through the prediction loop with a live model. You’ll see probability distributions, tweak sampling parameters, and watch outputs shift as you adjust temperature.

Open Lesson 9: What Is an LLM? Building Your Mental Model →

Module 1 (50 exercises, lessons 1-4) is free. No card needed. Bring your own free Gemini API key from Google AI Studio.

FAQ

How is an LLM different from a search engine?

A search engine retrieves documents that already exist on the web and match your query. An LLM generates text on the fly by predicting the next token based on patterns learned during training. When you search on Google, you get pages written by humans. When you prompt an LLM, you get text the model constructed word-by-word from probability distributions. That’s why search is better for “what’s today’s exchange rate” and LLMs are better for “help me rewrite this email.”

Do I need to understand how LLMs work to build with them effectively?

You don’t need it to use LLMs casually. But you do need it to build reliably. If you don’t know the model predicts tokens rather than retrieves facts, you’ll trust hallucinated output you shouldn’t. If you don’t understand attention costs, you’ll write bloated prompts and be surprised by the bill. The mental model in this post is essentially a debugging toolkit for when things go wrong.

Why does the same prompt give different answers each time?

Because sampling is stochastic by default. Unless temperature is set to 0, the model draws from a probability distribution at each token step, so there’s randomness baked into every response. Two runs of the same prompt aren’t guaranteed to produce the same output. Set temperature to 0 if you need reproducible results.

Why do context windows cost more the longer they get?

Because attention computation scales roughly as O(n²) with sequence length. Every token attends to every other token in the context, so adding tokens multiplies the work. It’s not linear: doubling the context is closer to quadrupling the compute. Providers optimize this in various ways (sliding windows, sparse attention, KV caching), but the underlying cost structure doesn’t disappear.

If the model only predicts tokens, how does it seem to reason?

Chain-of-thought prompting works by generating intermediate tokens that look like reasoning steps. When the model writes “Step 1: convert Fahrenheit to Celsius… Step 2: add 5…” those intermediate tokens shift the probability distribution for the final answer. The model doesn’t reason in a separate system and then report results; the reasoning tokens are part of the prediction process itself. This is why “think step by step” actually helps, and why the reasoning is sometimes wrong even when it looks structured.

Stop reading about how LLMs work. Try it. The first 50 exercises on TinkerLLM are free, no card needed.

Open the playground →

How LLMs Actually Work: A Mental Model in 4 Steps

TL;DR