Tree of Thought Prompting Explained

You asked an LLM to plan the most efficient order for completing five tasks, each with its own dependencies and deadlines. You used chain of thought prompting. The model committed to starting with Task 2, reasoned through what that unlocked, and built a full sequence from there. By the time it finished, the schedule looked coherent.

It also never considered starting with Task 4, which would have resolved three dependencies at once and freed up the afternoon. Chain of thought doesn’t backtrack. It can’t step back and ask: what if I’d started differently?

That’s the problem tree of thought prompting was designed to fix.

What Tree of Thought Prompting Is

Tree of thought (ToT) prompting is a technique where the model generates multiple candidate reasoning steps at each decision point, evaluates which ones look most promising, and explores those paths before committing to an answer.

The idea comes from Yao et al. (2023) at Princeton University and Google DeepMind. The paper frames it as deliberate problem solving: instead of following the first path that looks reasonable, the model generates options, thinks about which are most promising, explores those branches, and discards the dead ends. The same way you’d solve a puzzle by trying a few different starting moves before committing.

Regular chain of thought generates one path: step 1 leads to step 2 leads to step 3 leads to an answer. Tree of thought generates a tree: three candidate step 1s, evaluated against each other. The strongest two branch into further steps. Those branch again. At the end, you commit to the path with the highest cumulative score.

This changes what kinds of problems the model can actually handle well.

How It Differs from Chain of Thought

Chain of thought gives the model a way to reason step by step. But it doesn’t give the model a way to reconsider a step once taken. Once step 1 is written, everything after conditions on it. If step 1 was suboptimal, the whole chain drifts from there.

Think of it as the difference between walking one path through a maze and sending three people through simultaneously. The single walker takes each turn and keeps moving. If they hit a dead end, they don’t know there was a better route two turns back. The three people each take a different first turn, you compare where they are after a few steps, drop the ones who are clearly stuck, and let the remaining ones keep branching.

Here’s a concrete comparison on a product decision:

Chain of thought:

“I’ll start by locking the target user segment, then figure out pricing around that audience, then work backward to the feature set…”

Tree of thought:

“Three ways to approach this: (A) start with user segment, (B) start with pricing constraints, (C) start with the core technical capability. Let me evaluate each. Approach A risks building for a segment that can’t pay the margin we need. Approach B anchors the product to the wrong constraint. Approach C is closest to where our unfair advantage is…”

The CoT model is reasoning forward. The ToT model is reasoning across options before picking a direction. Both produce a structured response. But the ToT version catches the flaw in Approach A that a single forward chain would build on rather than question.

If you want the full foundation on chain of thought before going deeper into ToT, Chain of Thought Prompting: Make LLMs Show Their Work covers how CoT works, when to use zero-shot CoT vs few-shot CoT, and the cost trade-offs.

The Three Components of ToT

The original paper describes three parts. Knowing them helps you figure out which ones you can simulate in a prompt and which ones need real engineering.

1. Thought generation. At each step, the model produces multiple candidate next steps instead of one. For a planning task, that’s three different orderings to consider. For a writing task, it’s three candidate opening paragraphs. You can prompt for this directly: “Give me three different approaches to this next step. Make each one structurally distinct.”

2. State evaluation. After generating candidates, the model evaluates them. This can be relative (“which of these three approaches looks most promising and why?”) or scored (“rate each from 1-5 on feasibility”). The evaluation step is what stops the model from just defaulting to whichever option it generated first. Without it, ToT collapses back into CoT with extra words.

3. Search strategy. The controller decides how to expand the tree: breadth-first (explore all options at each level before going deeper) or depth-first (follow the most promising branch until it succeeds or hits a dead end, then backtrack). This part is genuinely hard to simulate in a single prompt because it requires managing state across multiple LLM calls.

In practice, most people use what you could call prompted ToT: ask the model to generate, evaluate, and commit in one response. You get most of the benefit without the engineering overhead. Full ToT with backtracking is a different thing, covered in the last section below.

A Working Example

Here’s tree of thought prompting on a content planning task with constraints:

I'm writing a blog post about a technical topic for a developer audience.
I need an opening hook.

Step 1: Generate 3 candidate opening sentences. Each should use a
different approach: one scenario-based, one with a surprising claim,
one that opens with a question.

Step 2: Evaluate each candidate. For each one, say: would a developer
find this interesting enough to keep reading, and does it set up the
rest of the post clearly?

Step 3: Based on your evaluation, which opening is strongest?
Rewrite it to improve it slightly if needed.

Run this and then run the same task as “Write an opening sentence for a technical blog post for developers.” The difference in output quality is usually visible immediately. The evaluation step in Step 2 is where the work happens: it forces the model to reason about quality criteria before choosing, rather than picking whatever it generated first.

Try It Yourself

Tree of thought prompting is covered in Lesson 21 at TinkerLLM, alongside chain of thought and meta-prompting. You can run the same task through all three techniques in sequence, which is the fastest way to see when ToT earns its cost and when CoT is genuinely enough.

💡 Try this hands-on: Lesson 21 covers advanced reasoning scaffolds with exercises you run against real Gemini models. Open Lesson 21: Advanced Prompt Engineering (CoT, Meta-Prompting, Prompt Chaining) → TinkerLLM uses a BYOK model (your own free Gemini API key from Google AI Studio). Your key stays in your browser, never on our servers.

You can also test ToT right now in any LLM interface. Take a decision you’re actually facing, run it once with a direct prompt, once with “Let’s think step by step,” and once with the generate-evaluate-commit structure above. Compare not just the answers but how the model handles the parts where it could go either way.

When ToT Actually Helps

Tree of thought adds the most value when a problem has these properties:

Branching decision structure. If an early wrong choice blocks good solutions downstream, ToT helps. Route optimization, task scheduling with dependencies, multi-step design decisions all fit. The road trip example at the top of this post is the clearest version of this: if you commit to the wrong starting city, no amount of good reasoning from that point recovers the optimal route.

No single obviously correct first step. If there’s one clearly right starting move, CoT will find it. ToT adds overhead without adding value. The technique earns its cost when multiple reasonable-looking starting points lead to very different outcomes.

Creative tasks with hard constraints. Writing a tagline, naming a product, designing a system architecture under specific constraints: all of these benefit from generating and comparing genuinely different approaches rather than iterating on the first one that comes out.

Cases where CoT keeps picking the same wrong path. If you’ve tried CoT on a task repeatedly and the model always commits to the same suboptimal approach, ToT breaks the pattern by forcing diversification at the first step.

When It’s Overkill

Most prompting tasks don’t need it.

If you’re summarizing a document, classifying customer feedback, translating text, or generating a cover letter, chain of thought is sufficient. The structure of those tasks is sequential, not branching. Adding ToT overhead costs 3-10x more in API calls and doesn’t improve the outcome.

The Yao et al. paper tested ToT on three specific benchmarks: Game of 24 (a math puzzle where you combine four numbers to reach 24), creative writing with specified constraints, and a mini-crossword. All three have explicit backtracking structure where early choices constrain later ones and wrong paths need to be abandoned. That’s the scope of the technique.

A practical rule: try CoT first. If CoT is consistently wrong and you can identify specific branching decision points in the problem structure, try prompted ToT. Don’t default to ToT because it sounds more sophisticated.

Prompted ToT vs Full ToT

There’s a distinction most explanations skip over.

Prompted ToT is what most people mean when they say “tree of thought prompting.” You write a single prompt that asks the model to generate candidates, evaluate them, and commit to the best one, all in one response. This works for many tasks and requires no external infrastructure.

Full ToT requires an external controller: a program that calls the LLM multiple times, stores the intermediate states, runs the evaluation step as a separate call, decides which branches to expand (BFS or DFS), and manages the backtracking. This is what the Yao et al. paper actually implemented. LangGraph is one library that can handle this kind of stateful orchestration, but it’s real engineering, not just prompting.

For planning, content creation, and structured decision-making, prompted ToT captures most of the benefit. Full ToT matters for hard combinatorial problems where the solution space is large enough that a single context window can’t hold the whole tree.

FAQ

What’s the difference between tree of thought and self-consistency?

Self-consistency runs the same CoT prompt several times and takes the most common final answer. Tree of thought generates structurally different reasoning paths and evaluates them at each step. Self-consistency is about reducing variance in a single reasoning approach. ToT is about exploring genuinely different approaches. Both require multiple LLM calls, but for different reasons. Self-consistency works well on math benchmarks where one correct path exists and you’re just averaging noise. ToT works better when the right answer requires exploring structurally different starting moves.

Do I need to build anything to try tree of thought prompting?

Not for prompted ToT. You can run the generate-evaluate-commit structure in any LLM interface using a single prompt. Full ToT, as in the original paper, requires external state management and multiple API calls. If you’re using TinkerLLM, you can run prompted ToT exercises in the playground without writing any code. Lesson 21 has exercises built around this.

Is tree of thought prompting the same as prompt chaining?

No. Prompt chaining passes the output of one prompt as the input to the next in a linear sequence. Tree of thought generates multiple branches at each step and evaluates them before deciding which ones to follow. Prompt chaining is linear. ToT is branching. You can combine them: a prompt chain can implement the ToT structure by passing intermediate states between calls.

When was tree of thought prompting published?

The original paper is “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” by Yao et al. (2023), from Princeton and Google DeepMind. It appeared in May 2023 and built on the chain of thought literature, particularly Wei et al. (2022) on CoT and the self-consistency work that followed.

Does tree of thought work better on larger models?

Yes, meaningfully so. Thought generation and state evaluation are the two steps where model quality matters most. On smaller models (roughly under 7B parameters), the generated candidate thoughts often aren’t diverse enough to make ToT worth the extra cost. You’re generating variations of the same idea. On models like Gemini Pro or GPT-4o, the evaluation step is noticeably more reliable, and the generated candidates are more genuinely distinct. That said, even on mid-sized models, ToT adds value for tasks with clear branching structure where the problem itself forces diversity rather than relying on the model to generate it.

Stop reading about tree of thought prompting. Try it. The first 50 exercises on TinkerLLM are free, no card needed.

Open the playground →

Tree of Thought Prompting: Beyond Chain of Thought

TL;DR