How to Count Tokens Before Sending a Prompt
Count tokens in Python with tiktoken, the Gemini API, and quick estimates before your prompt hits a context limit.
TL;DR
- • tiktoken is OpenAI's Python library for counting GPT tokens locally. Install with pip, pick an encoding, call encode().
- • The Gemini API has a native countTokens endpoint that returns the exact count before generation.
- • Quick estimate without a library: 1 token ≈ 4 characters for English, fewer for code.
- • Token counting prevents context-limit errors in batch jobs and helps estimate API costs before they happen.
- • TinkerLLM Lesson 11 covers how tokenization affects your actual prompt budget.
You’re processing a 100-page PDF in a Python loop. Halfway through, the API throws a context-limit error. You had no idea how many tokens it was before you sent it. That’s what token counting solves, and it takes about 10 lines of code.
This post covers three approaches: tiktoken for OpenAI and GPT models, the Gemini API’s native counting endpoint, and quick character-based estimates when you don’t want to make an API call. Each takes about five minutes to wire up.
If you’re fuzzy on what tokens actually are (they’re not characters, and they’re not words), Tokens Explained: How LLMs Read and Write covers the mechanics. This post assumes you already know what a token is and want to count them before sending.
Why count tokens before you send?
Three situations where this matters in real work.
Context window limits. Every model has a maximum input length. Gemini 2.5 Flash supports 1,048,576 tokens; GPT-4o supports 128,000. Exceed the limit and you get an error or silent truncation, depending on the provider. Counting first means you know before the API tells you.
Cost estimation. Most providers charge per input token on paid tiers. If you’re processing hundreds of documents in a loop, off-by-5× token estimates produce off-by-5× bills. Counting lets you project costs before running a job, not after.
Rate limit planning. Tokens-per-minute (TPM) limits are separate from requests-per-minute (RPM). If you’re batching calls, knowing each call’s token count tells you how many you can send per minute without triggering a 429.
For a one-off test prompt, none of this matters. For anything that loops over user-supplied content, count first.
tiktoken: the tool for GPT and OpenAI models
tiktoken is OpenAI’s official Python library for their BPE tokenizer. It runs entirely locally, no API calls needed. Install it once:
pip install tiktoken
Basic usage to count tokens for a string:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "The quick brown fox jumps over the lazy dog."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
# Token count: 9
For model-specific encoding that handles version changes automatically:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "What is retrieval-augmented generation and how does it work with vector databases?"
print(f"Token count: {len(enc.encode(text))}")
# Token count: 15
Use encoding_for_model() instead of hardcoding an encoding name. Different models use different vocabularies, and the wrong encoding gives you numbers that are off by 10-20%. You don’t want to discover that on a production outage.
Three encodings you’ll encounter:
| Encoding | Models |
|---|---|
cl100k_base | GPT-4, GPT-4o, GPT-3.5-turbo, text-embedding-ada-002 |
o200k_base | GPT-4o (newer releases), o1, o3 series |
r50k_base | GPT-2, older text-davinci models |
When you’re unsure which encoding to pick, encoding_for_model(model_name) figures it out.
Counting messages, not just strings
In a real OpenAI API call, you send an array of messages, not a raw string. That format adds token overhead beyond the text itself:
import tiktoken
def count_message_tokens(messages, model="gpt-4o"):
enc = tiktoken.encoding_for_model(model)
# Each message adds 3 overhead tokens (role, separator, end marker)
num_tokens = 0
for message in messages:
num_tokens += 3
for key, value in message.items():
num_tokens += len(enc.encode(value))
num_tokens += 3 # every reply is primed with 3 tokens
return num_tokens
messages = [
{"role": "system", "content": "You are a document summarizer. Be concise."},
{"role": "user", "content": "Summarize the following text in three sentences."},
]
print(count_message_tokens(messages))
# 28
The overhead per message is small (about 3 tokens), but it adds up when you’re maintaining a long conversation history. If you’re trying to fit a multi-turn chat within a context limit, count the full message array, not just the latest user message.
The OpenAI cookbook maintains an updated version of this function as the API format evolves. That’s the reference to bookmark and recheck when you upgrade model versions.
Counting tokens with the Gemini API
Gemini doesn’t use tiktoken. Google has their own tokenizer, and they expose it as a native API method. No extra library needed beyond the SDK you’re already using:
from google import genai
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
response = client.models.count_tokens(
model="gemini-2.5-flash",
contents="What is retrieval-augmented generation and how does it work with vector databases?"
)
print(f"Token count: {response.total_tokens}")
# Token count: 12
This makes a real API call, but it returns a count rather than a generated response. You’re charged the same as sending those input tokens in a generation call, but there are no output tokens. On the free tier, it counts against your tokens-per-minute and requests-per-minute limits.
For a system prompt plus a user message:
from google import genai
from google.genai import types
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
response = client.models.count_tokens(
model="gemini-2.5-flash",
contents=[
types.Content(
parts=[types.Part(text="You are a document summarizer.")],
role="system"
),
types.Content(
parts=[types.Part(text="Summarize this 5,000-word document.")],
role="user"
),
]
)
print(response.total_tokens)
If you’re setting up the Gemini SDK for the first time, How to Use the Gemini API in Python (Step by Step) covers installation and authentication. The count_tokens call fits into the exact same pattern.
Use count_tokens as a pre-flight check before sending large documents. A failed generation call wastes your quota and your time. A count call costs the same input-token quota but tells you upfront whether you’re within limits.
Quick estimates without a library
Sometimes you don’t have tiktoken installed and you don’t want to make an API call just to get a rough number. These approximations work well enough for planning:
| Content type | Rough estimate |
|---|---|
| English prose | 1 token ≈ 4 characters |
| English prose | 1 token ≈ 0.75 words |
| Python / JavaScript code | 1 token ≈ 3 characters |
| Non-English (most languages) | 2-5× more tokens than equivalent English |
| JSON | Similar to code, roughly 3-4 characters per token |
For a 10,000-character English document, expect roughly 2,500 tokens. That’s comfortable inside a 128K context window but worth verifying before you process a 200,000-character document.
Non-English text is where estimates break down fast. A paragraph in Hindi or Arabic typically uses 3-5× more tokens than the equivalent English text. If you’re building anything multilingual, count rather than estimate.
One thing that trips people up
tiktoken’s encode() returns a list of integers (raw token IDs), not a list of word strings. If you want to see the actual text chunks the tokenizer created:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "vector database"
for token_id in enc.encode(text):
print(repr(enc.decode([token_id])))
# 'vector'
# ' database'
“vector database” tokenizes to 2 tokens. “vector databases” is also 2 tokens. But “vectordatabase” (no space) becomes 3 tokens. Whitespace boundaries matter in ways you wouldn’t predict just by looking at the text. Run the decoder on your key phrases once. It makes your token estimates more accurate and often explains strange model behavior around compound technical terms.
Token counting in production
Once you’re counting, the pattern is: count, decide, then send. Not: send, catch the error, investigate.
from google import genai
def safe_generate(client, model, contents, max_tokens=100_000):
"""Count before generating. Raise early if over limit."""
token_count = client.models.count_tokens(
model=model,
contents=contents
).total_tokens
if token_count > max_tokens:
raise ValueError(
f"Prompt is {token_count} tokens, exceeds {max_tokens} limit"
)
return client.models.generate_content(model=model, contents=contents)
Three places where this pattern belongs in a real app:
- Before processing user-uploaded content. If your app accepts document uploads, count the tokens before sending. Show a warning or truncate intelligently rather than returning a cryptic API error.
- In batch processing loops. Count each item, track the running total, and pause or throttle when you’re approaching the TPM limit.
- For cost estimation dashboards. Sum input tokens across a session to show estimates or enforce per-user usage caps. Useful if you’re running on paid-tier Gemini Pro and want to prevent runaway costs.
The pattern adds about 10 lines per call point. It prevents a class of production errors that are annoying to debug and expensive to discover at scale.
Try It Yourself
TinkerLLM’s tokens lesson walks through how tokenization works in practice, including why the same 100 words produce very different token counts in different languages and how to estimate your prompt budget before you hit a limit.
Open Lesson 11: Tokens: The Atomic Unit of Every LLM Interaction →
The exercises run against a live Gemini API call from your browser. You’ll see tokenization affect real model behavior. First 50 exercises are free, no card needed. You’ll need a free Gemini API key from Google AI Studio to run them.
FAQ
Does tiktoken work for Gemini or Claude?
No. tiktoken is OpenAI-specific. For Gemini, use the native count_tokens method from the Google Gen AI SDK (shown above). For Anthropic’s Claude, there’s a built-in client.messages.count_tokens() method on the Anthropic SDK. For providers without a native counter, the 4-characters-per-token approximation is your fallback for English text.
How many tokens can Gemini 2.5 Flash handle?
The context window is 1,048,576 tokens (1 million). In practice, very long contexts can degrade for tasks requiring attention across the full document. For most production workloads, keeping individual calls under 200,000 tokens is a reasonable target. Check the Gemini API model reference for the current official limits, since these change as Google updates models.
Does the Gemini countTokens call use quota?
Yes, input-token quota only. You’re charged the same as sending those tokens in a real generation call, minus any output tokens. On the free tier, it counts against your tokens-per-minute and requests-per-minute limits. For routine pre-flight checks before large document calls, this is fine. If you’re calling it inside a tight loop on thousands of small strings, cache the results.
Is tiktoken free to use?
Yes. tiktoken is open source under the MIT license and available on PyPI. It runs locally with no API calls. The encoding vocabulary files download from OpenAI’s servers on first use and cache locally after that. No API key, no usage cost, no account required.
Why does non-English text use more tokens?
Tokenizer vocabularies are built from training data, and English dominates most LLM pretraining datasets. The tokenizer has a fine-grained vocabulary for common English words, so most of them map to a single token. For underrepresented languages, the tokenizer falls back to smaller subword units and sometimes byte-level chunks, needing more tokens to represent the same meaning. A sentence that costs 12 tokens in English might cost 30-50 tokens in some other languages. This is a known issue, and model providers are actively improving tokenizer coverage for non-English languages.
You’ve got tiktoken installed and the Gemini count endpoint wired up. The logical next step is seeing how token budget interacts with real model behavior. TinkerLLM Lesson 11 runs those exercises from your browser, with live Gemini API calls. First 50 exercises are free.
Delivery lead at Kalvium Labs with a background in instructional design. Writes concept explainers and process posts. Thinks about how people actually learn before jumping to solutions.
LinkedInWant to try this yourself?
Open the TinkerLLM playground and experiment with real models. 50 exercises free.
Start Tinkering