Chain of Thought Prompting Explained

You gave an LLM a word problem: “A store sells apples for ₹12 each and oranges for ₹18 each. Priya buys 5 apples and 3 oranges. How much does she pay in total?” The model answered ₹108. The correct answer is ₹114. You tried again with the same prompt. Same wrong answer, same confidence.

Then you added four words to the end: “Let’s think step by step.”

The model worked through it: 5 apples × ₹12 = ₹60, then 3 oranges × ₹18 = ₹54, total = ₹114. Correct.

That’s chain of thought prompting. And it’s one of the most useful techniques you’ll add to your prompting toolkit, not because it’s clever, but because it changes how the model processes the problem.

What Chain of Thought Prompting Is

Chain of thought (CoT) prompting asks the model to generate intermediate reasoning steps before giving the final answer. Instead of jumping from question to output, the model thinks out loud. You get to see each step before the conclusion.

The technique came from Wei et al. (2022) at Google Research, who found that large models perform significantly better on reasoning tasks when given worked examples that show the full reasoning trace, not just input-output pairs. Around the same time, Kojima et al. (2022) showed you don’t even need examples: appending “Let’s think step by step.” causes models to generate reasoning chains on their own.

I’ve been testing CoT prompts across different task types for a while now, and the thing that still surprises me is how much one phrase changes the output, not just in correctness, but in how auditable the answer becomes.

The reason it works is how LLMs generate text. Every token the model writes is a prediction conditioned on everything before it. When you ask for a direct answer, the model has to bridge from the question to the answer in one step. When you give it space to reason first, each intermediate step becomes context for the next step. The model is solving a series of smaller problems, each one easier than the whole.

You can see this clearly in the apple-orange example. Jumping to ₹108 is a pattern-completion error: the model recombined the numbers plausibly but wrongly. Working through the steps forces it to commit to each sub-answer before moving on. And that commitment is where the errors get caught.

Zero-Shot CoT: The One-Phrase Fix

Zero-shot CoT is the simplest form. You don’t change the question. You don’t add examples. You just append a reasoning trigger to the end.

The phrase that works consistently: Let's think step by step.

Kojima et al. found this single phrase improved accuracy on reasoning benchmarks by 18-44 percentage points across model families. It changes what the model interprets as the expected shape of the response. The model has learned from training data that this phrase typically precedes a detailed reasoning trace, so it generates one.

Here’s a concrete before and after:

Without CoT:

Prompt: If you have 3 boxes with 4 apples each, then remove 5 apples, how many remain?
Response: 7 apples

With zero-shot CoT:

Prompt: If you have 3 boxes with 4 apples each, then remove 5 apples, how many remain? Let's think step by step.
Response:
Step 1: Total apples = 3 boxes × 4 apples = 12 apples
Step 2: After removing 5: 12 - 5 = 7 apples
Final answer: 7 apples

Both return the same answer here. But on harder problems, the version without CoT drifts, and you can’t tell where it went wrong. The version with CoT produces a reasoning chain you can audit. If step 1 had said “3 × 4 = 14,” you’d catch it and know exactly what broke.

Other trigger phrases that produce similar behavior:

“Work through this step by step.”
“Show your reasoning before giving the final answer.”
“Think carefully before answering.”

The exact phrasing matters less than the signal: you want the model to commit to a reasoning process before committing to the answer.

Few-Shot CoT: Showing the Model How to Reason

Few-shot CoT goes further. You give the model 2-4 worked examples, each showing the full reasoning trace, before the real question. This is more reliable than zero-shot CoT on complex tasks, and it lets you control the format of the reasoning.

Here’s what a few-shot CoT setup looks like for customer feedback classification:

Question: A customer says "I waited 30 minutes and the food was cold." What's the emotion category?
Reasoning: The customer describes two specific service failures: long wait time and poor food temperature. Both are negative signals. The tone is factual, not angry. No positive offset is mentioned.
Answer: Frustrated

Question: A customer says "The delivery was late but the meal was amazing." What's the emotion category?
Reasoning: "Late delivery" is a clear negative on service. "Amazing meal" is a strong positive on product quality. The sentence structure puts positive last, suggesting the customer is leaning positive overall but flagging a service issue.
Answer: Mixed (positive product, negative service)

Question: A customer says "I'll never order from here again." What's the emotion category?
Reasoning:

The model sees the pattern: question, then reasoning trace, then labeled answer. When it hits the third question with an incomplete entry, it completes it in the same structure, generating a reasoning trace before committing to the answer.

Two things make the examples work:

The reasoning traces should show actual reasoning, not just restate the answer. “The customer is unhappy” is not a reasoning trace. “The phrase ‘never again’ signals strong negative intent even without a stated reason” is.
Pick examples that cover the variation you care about. If all your examples are simple cases, the model won’t generalize to the edge cases that actually matter.

Try It Yourself

Chain of thought prompting is something you need to run, not just read about. The difference between CoT and a direct prompt is easiest to see on a problem that actually breaks without it.

💡 Try this hands-on: Lesson 21 on TinkerLLM covers chain of thought, meta-prompting, and prompt chaining with exercises you run against real models. Open Lesson 21: Advanced Prompt Engineering (CoT, Meta-Prompting, Prompt Chaining) → The first 50 exercises are free, no card needed.

You can also test this in any LLM interface. Run the same logic puzzle twice: once with a direct prompt, once with “Let’s think step by step.” appended. Notice not just whether the answer changes, but whether you can audit the reasoning when it does. In my experience, that auditability matters as much as the accuracy lift.

When CoT Actually Helps

CoT is not a universal improvement. It adds meaningful value in specific situations and adds unnecessary cost in others.

CoT works well for:

Multi-step arithmetic and algebra. Any problem where the answer depends on chaining multiple calculations. Tax math, unit conversions, percentage problems are all in this category.
Logic and commonsense reasoning. “Maria leaves before the library opens at 9am. Carlos arrives at 9:30am. Could they have arrived together?” CoT forces the model to set up the timeline explicitly before answering.
Multi-signal classification. Tasks where the category depends on combining multiple signals. The reasoning trace makes it clear which signals were weighted and how.
Planning and sequencing. “What order should I complete these tasks to minimize wait time?” The model needs to reason about dependencies before it can answer correctly.

CoT often doesn’t help for:

Simple factual retrieval. “What is the capital of France?” Adding CoT just produces a longer correct answer at extra cost.
Creative tasks. Writing a short story or brainstorming product names doesn’t benefit from step-by-step reasoning. The creativity comes from variety in generation, not from careful logic.
Standard translation. Models translate at the phrase level during generation. Asking them to reason about translation before doing it doesn’t improve accuracy on standard text.
High-volume, low-stakes tasks. CoT adds latency because it generates more tokens. For tasks where being slightly wrong is tolerable and throughput matters, direct prompting is usually right.

A quick test: if you can draw the intermediate steps on a whiteboard before reaching the answer, CoT will probably help. If the answer is just recalled or generated without a reasoning chain, it probably won’t.

The Cost Trade-Off

Chain of thought prompting costs more. This is real and you need to plan for it.

A direct answer to a math problem might be 15-25 tokens. A CoT answer to the same problem might be 80-250 tokens, all output tokens. On Gemini Flash, output tokens cost roughly 4x more than input tokens per unit. The multiplier on GPT-4o is similar. So for a CoT answer that’s 10x longer, the effective cost per query increases substantially.

For a high-volume use case, say 50,000 classification requests per day, switching from direct prompting to CoT might multiply your daily API cost by 3-8x. That’s not a reason to avoid CoT. It’s a reason to use it selectively.

Practical strategies that keep costs manageable:

Route by complexity. Keep a simple complexity signal in your pipeline. If the request clearly requires multi-step reasoning, use CoT. If it’s a straightforward lookup, don’t.
Use a smaller model for CoT. A fast model with CoT often beats a powerful model without it on reasoning tasks. Gemini Flash with CoT frequently outperforms a direct prompt to a heavier model, at lower cost.
Cache CoT outputs for repeated queries. If you’re classifying similar customer feedback messages, cached reasoning traces are free.

Combining CoT with Other Techniques

CoT works well alongside other prompting approaches. You don’t have to pick one.

Few-shot + CoT. Show 2-3 examples with full reasoning traces, then give the real question. You get format control from few-shot and reasoning quality from CoT. The Zero-Shot vs Few-Shot vs Chain of Thought post covers combining them in detail.

System instructions + CoT. Put the reasoning trigger in your system instruction rather than the user prompt: “Before answering any question that requires calculation, show your work step by step.” Every request to that model instance uses CoT automatically, without you adding it to each prompt.

CoT + self-consistency. Run the same CoT prompt 3-5 times and take the most common final answer. More expensive (3-5x the output tokens), but the Wei et al. paper showed 10-20% accuracy improvements over single CoT on math benchmarks. I find this worth it when errors have real consequences.

If you’re newer to prompting techniques overall, What is Prompt Engineering? covers the broader landscape. CoT sits in the reasoning-scaffold section and makes more sense once you understand the foundations.

FAQ

Does chain of thought prompting work on all LLMs?

Mostly yes, but effectiveness scales with model size. The Wei et al. (2022) paper found that CoT provides minimal benefit on models below roughly 7B parameters, because smaller models don’t have enough capacity to generate coherent intermediate reasoning. On modern models like Gemini Flash, GPT-4o mini, or Claude 3 Haiku, you’ll see meaningful improvements on reasoning tasks. The effect is strongest on the largest available model in a family.

Can I use chain of thought prompting for coding tasks?

Yes, and it helps in specific ways. If you’re asking a model to debug code, prompting it to “explain what each line does before identifying the bug” often surfaces the error that a direct “find the bug” prompt misses. For algorithm design, asking it to “describe the approach before writing the code” catches logical errors before they’re encoded in syntax. It’s less useful for straightforward code generation where the model has seen thousands of similar examples.

How long should the reasoning trace be?

As long as the problem requires. For two-step arithmetic, two or three sentences. For complex logic, as many steps as needed. Don’t try to constrain the reasoning trace length in your prompt: the model should generate as many steps as the problem takes. If you’re concerned about output token cost, route the task through a cheaper model rather than truncating the reasoning.

Is chain of thought the same as asking the model to explain its answer?

Not quite. “Explain your answer” asks the model to justify a conclusion after it reaches it. This can produce post-hoc rationalization: the model generates a plausible-sounding explanation for a conclusion it reached without actually reasoning through it. Chain of thought prompting asks for the reasoning before the conclusion, so the intermediate steps are part of the path to the answer. The difference matters most on edge cases and ambiguous problems where the reasoning process itself determines whether the answer is correct.

Does adding “Let’s think step by step” always improve accuracy?

No. On tasks that don’t require multi-step reasoning, it often produces the same answer with more words. And on some tasks, it can actually introduce errors: if the model’s step-by-step reasoning goes wrong at an early step, it commits to that error and builds on it. The technique is most reliable on problems that clearly require chaining multiple steps together. Use it selectively based on task type, not as a blanket upgrade.

Stop reading about chain of thought prompting. Try it. The first 50 exercises on TinkerLLM are free, no card needed.

Open the playground →

Chain of Thought Prompting: Make LLMs Show Their Work

TL;DR