Self-Consistency Prompting: When One Answer Isn't Enough
Self-consistency samples multiple reasoning chains and votes on the answer. Here's when it helps, when it doesn't, and how to implement it.
TL;DR
- • Self-consistency runs the same CoT prompt multiple times at high temperature and returns the most frequent final answer: majority vote over reasoning paths.
- • It works best on tasks with one correct answer: math, logic, factual Q&A. It does nothing useful for creative or open-ended tasks.
- • You pay N times the token cost for N sampled paths. That tradeoff is worth it for high-stakes reasoning, not everyday prompts.
- • Temperature 0.7-1.0 gives you genuine path diversity without producing incoherent reasoning chains.
- • The implementation is simple: run the prompt N times, extract the final answer from each, return the majority.
You pasted a math problem into an LLM. The answer came back: ₹15,000. You weren’t sure, so you ran it again. ₹14,800. Third run: ₹15,000. You went with ₹15,000 because it showed up twice.
That instinct is self-consistency prompting. Running the same prompt multiple times at a higher temperature, collecting the outputs, and picking the most frequent answer. It’s a formalized version of exactly what you just did. I’ve been using this technique on client reasoning tasks for months and the accuracy gains are genuine. Google Research published it in 2022, and my experience matches their benchmarks closely. Here’s how it works and when it’s worth the extra cost.
What Self-Consistency Prompting Is
Self-consistency is an extension of chain-of-thought prompting, not a replacement. Your prompt looks identical to a normal CoT prompt. You still tell the model to reason step by step. What changes is what you do after you get the output.
Standard chain of thought: send the prompt once at low temperature, get one reasoning chain, get one answer. Self-consistency: send the same prompt N times at higher temperature, collect N diverse reasoning paths, and return the answer that appears most often.
The logic: when you sample reasoning paths with some randomness, wrong answers should diverge. Different mistakes lead to different wrong conclusions. The right answer, if the model can reach it reliably, should cluster. Majority vote filters the noise.
Wang et al. tested this in their 2022 paper Self-Consistency Improves Chain of Thought Reasoning in Language Models. On the GSM8K math benchmark, a single chain-of-thought pass on PaLM-540B scored around 56% accuracy. With self-consistency across 40 sampled paths, the same model reached 74%. That’s 18 percentage points gained without changing the model, the prompt format, or the training data.
How It Differs From a Single CoT Pass
Chain-of-thought prompting (covered in depth in Chain of Thought Prompting: Make LLMs Show Their Work) already fixes many LLM reasoning failures. You ask the model to show its work and it reasons through sub-problems before committing to an answer. This works well for most tasks.
But single CoT has one vulnerability. If the model starts down the wrong reasoning path, it will follow that path confidently to the wrong answer. There’s no backtracking. There’s no cross-checking. You get a flawless proof of an incorrect conclusion.
Self-consistency adds a check. You’re not trusting one path. You’re running the reasoning process multiple times and asking: which answer does the majority converge on? If 7 of your 10 sampled paths reach ₹15,000 and 3 reach different numbers, those 3 likely made different errors, which is why they didn’t cluster. The convergent answer is more likely to be right.
This doesn’t mean single CoT is broken. For simple reasoning, one pass is fine. Self-consistency adds value when the task is hard enough that a single path has a meaningful failure probability.
When Self-Consistency Actually Helps
I’ve tested self-consistency on a range of tasks and the performance gains are real. But they’re not universal. The technique has a clear domain: tasks with one objectively correct answer.
It works well on:
- Math word problems and arithmetic
- Logical deduction and constraint satisfaction
- Factual questions with verifiable answers
- Code tracing and debugging (does this function return X or Y?)
- Commonsense reasoning benchmarks like StrategyQA
It doesn’t help on:
- Creative writing (there’s no “most correct” ending to a story)
- Open-ended advice or recommendations
- Summarization (free-form outputs don’t aggregate cleanly by majority vote)
- Tasks where you want diversity, not convergence
If you can’t define what “the right answer” looks like, majority vote has nothing to work with. Self-consistency is not a general accuracy booster. It works specifically because some reasoning paths reach the correct answer and other paths make different errors that scatter in different directions.
The Majority Vote in Practice
The implementation is simpler than it sounds. You don’t parse or compare reasoning chains. You only look at the final answer each path produces.
My prompt format instructs the model to put its final answer on the last line in a predictable structure. Something like: “Work through this step by step. State your final answer on the last line as: Final answer: [value].” I’ve found this phrasing more reliable than asking for a final answer anywhere in the response.
Then you run that prompt N times at temperature 0.8, extract the final answer line from each response, and count. The most frequent answer wins.
Here’s the Python structure using the Gemini API (which you can set up for free at Google AI Studio):
import google.generativeai as genai
from collections import Counter
genai.configure(api_key='YOUR_GEMINI_KEY')
model = genai.GenerativeModel('gemini-2.0-flash')
def self_consistent(prompt: str, n: int = 10, temp: float = 0.8) -> str:
config = genai.types.GenerationConfig(temperature=temp)
answers = []
for _ in range(n):
response = model.generate_content(prompt, generation_config=config)
lines = [l.strip() for l in response.text.strip().split('\n') if l.strip()]
for line in reversed(lines):
if 'final answer' in line.lower():
answers.append(line.split(':', 1)[-1].strip())
break
most_common, count = Counter(answers).most_common(1)[0]
return most_common
The prompt structure matters more than the code. If the model formats its final answer inconsistently across runs, the vote fragments across equivalent answers that look different as strings. In my testing, locking down the answer format in the prompt is the single most important implementation detail. Get that right and the aggregation becomes reliable.
Try It Yourself
You can test self-consistency without writing code. Open the TinkerLLM playground, paste in a math word problem or logic puzzle, and run it 5 times with temperature set to 0.8. Look at the last line of each response. Watch for the clustering effect.
💡 Try this hands-on: Lesson 21 on TinkerLLM covers advanced prompting including chain of thought, self-consistency, and prompt chaining. You run exercises against real Gemini models, with your own API key from Google AI Studio.
TinkerLLM uses a BYOK model. Your Gemini API key stays in your browser, never on our servers. The free tier from Google AI Studio is plenty for all the exercises.
How Many Samples Do You Need?
The original paper tested up to 40 samples. You don’t need 40.
The accuracy curve flattens fast. Most of the gain comes from the first 10 samples. After that, each additional sample adds less than 0.5 percentage points on typical benchmarks. Practical guidance:
- 5 samples: gets you most of the benefit at 5× the cost. Good for quick checks.
- 10 samples: the standard recommendation. Meaningful accuracy gain, manageable overhead.
- 20-40 samples: only if the cost of a wrong answer is very high and latency isn’t a constraint.
For temperature, stay between 0.7 and 1.0. Below 0.7 and your samples are too similar. You’re generating nearly the same reasoning path every time, which defeats the purpose. Above 1.0 and the reasoning chains start becoming incoherent, which inflates wrong-answer clusters.
The Cost Tradeoff Is Real
Self-consistency is N times as expensive as a single pass. There’s no way around this, and it’s worth being honest about.
If your task is already correct 90% of the time with single CoT, self-consistency with 10 samples might push you to 93% at 10× the cost. That’s a bad trade for most use cases.
The technique makes sense when:
- You need high accuracy on a task with a definite answer
- The cost of a wrong answer (in time, money, or trust) exceeds the cost of N inference calls
- You have latency headroom, because responses will take longer
My rule of thumb: don’t use it as a default. Use it when the accuracy delta justifies the multiplier. For most everyday prompting, single CoT works fine and you’re just spending tokens you don’t need to spend. Save self-consistency for the hard problems.
FAQ
How is self-consistency different from just running the same prompt again?
Running it again is the same action done informally. Self-consistency adds two things: you do it intentionally across N samples with a defined temperature, and you aggregate by majority vote rather than picking one response arbitrarily. The structure forces you to decide what “the answer” looks like before you run it, so you can actually compare outputs across runs. In my experience, without that structure you’re just manually picking whichever output looks right, which introduces your own confirmation bias.
Does temperature matter a lot here?
Yes. Too low (below 0.5) and your samples will be nearly identical. You’re running the same reasoning path multiple times without real variation, which means you’re paying for diversity you’re not getting. Too high (above 1.2) and the reasoning chains start producing incoherent steps, which inflates wrong-answer clusters and corrupts the vote. The 0.7-1.0 range gives you genuine variation in how the model approaches the problem while keeping each individual chain internally consistent.
Does this work with any LLM?
Yes. Self-consistency is a prompting strategy that runs on top of any model that supports temperature control. It works with Gemini, GPT-4o, Claude, Mistral, and open-source models. My experience is that the accuracy gains are larger on models that are already capable at reasoning. The technique amplifies existing reasoning ability. It doesn’t create reasoning capacity where there isn’t any.
How does self-consistency compare to Tree of Thought prompting?
They address overlapping problems with different mechanisms. Self-consistency samples multiple complete reasoning paths from start to finish and aggregates at the end. Tree of Thought explores multiple partial paths at each reasoning step, branching and pruning mid-chain before finishing. Self-consistency is simpler to implement and works well for linear reasoning tasks where each path goes start-to-finish without needing to backtrack. Tree of Thought is better for problems that genuinely require exploring dead ends mid-reasoning. More on how the two techniques compare in Tree of Thought Prompting: Beyond Chain of Thought →.
What if the vote ties?
Use an odd number of samples (5, 9, 11) to reduce ties. When a tie happens anyway, it’s usually a signal that the problem is genuinely hard for the model at this temperature. The paths can’t converge. In that case, you can look at which reasoning chain is more internally consistent, try a different prompt formulation, or acknowledge that the model is uncertain and verify the answer with a different source.
Stop reading about self-consistency prompting. Try it. The first 50 exercises on TinkerLLM are free, no card needed.
Delivery lead at Kalvium Labs with a background in instructional design. Writes concept explainers and process posts. Thinks about how people actually learn before jumping to solutions.
LinkedInWant to try this yourself?
Open the TinkerLLM playground and experiment with real models. 50 exercises free.
Start Tinkering