Context Windows Explained: Why Your Long Prompt Gets Cut Off
Context windows are your token budget per LLM call: input plus output combined. Here's how limits vary across models and four ways to stay under them.
TL;DR
- • A context window is your total token budget for one LLM call: input tokens plus output tokens combined.
- • Gemini 2.5 Flash has a 1,048,576-token window. GPT-4o has 128,000. Claude 3.5 Sonnet has 200,000. These are not interchangeable.
- • Exceeding the limit causes an error or silent truncation depending on the provider. You don't always get a warning.
- • Output tokens draw from the same budget as input tokens. Plan both sides.
- • Four fixes when content doesn't fit: chunk it, summarize first, extract selectively, or use RAG.
You pasted a 50-page contract into Gemini and got an answer. But it missed a clause that was clearly on page 43. Or you got a blunt error: “prompt is too long.” Or the model answered the first half of your question and stopped. All three of those trace back to the same thing: the context window, and what happens when you hit its edge.
Understanding how context windows work in LLMs doesn’t take long. But it changes how you structure prompts, how you architect applications, and how you debug the weird half-answers you get from long documents. I’ve watched this click for a lot of people after about 10 minutes of hands-on work. The confusion usually comes from one wrong assumption about what the limit actually covers.
What a context window actually is
A context window is your total token budget for one LLM interaction. Not just your input. It covers everything the model can “see” and generate in a single call: your system prompt, your conversation history, any examples you included, the document you pasted, and the model’s own response.
Tokens are the unit, not characters or words. One token is roughly 4 characters of English prose. “Context” is one token. “Contextualization” is three. Numbers, punctuation, and whitespace often become their own tokens. A 10,000-word document is around 7,500 tokens of input.
So when a model says it has a 128,000-token context window, the combined total of everything you send plus everything it returns can’t exceed 128,000 tokens. Send 125,000 tokens of input and the model only has 3,000 tokens left to respond with. That’s usually where I see the confusion happen. People think they have room, then wonder why the response is truncated or unusually terse.
If you want to see the exact count for your specific text before sending, How to Count Tokens Before Sending a Prompt covers the tiktoken and Gemini API approaches in about 10 lines of code each.
Context window sizes across major models (2026)
These change when providers update their models. The Gemini API model reference and OpenAI models page are the authoritative sources. The table below reflects mid-2026 defaults:
| Model | Context window | Rough word equivalent |
|---|---|---|
| Gemini 2.5 Pro | 1,048,576 tokens | ~700,000 words |
| Gemini 2.5 Flash | 1,048,576 tokens | ~700,000 words |
| Claude 3.5 Sonnet | 200,000 tokens | ~150,000 words |
| Claude 3 Opus | 200,000 tokens | ~150,000 words |
| GPT-4o | 128,000 tokens | ~96,000 words |
| Llama 3.3 70B | 128,000 tokens | Varies by host |
Gemini’s 1-million-token window is genuinely unusual. Most of the field was at 8,000-32,000 tokens two years ago. But a large context window doesn’t mean the model handles all parts of a long document equally well. That’s the gotcha you learn after you start relying on it for real work.
What actually happens when you exceed the limit
Different providers handle overflow differently. And none of them do it gracefully.
Error on submission. OpenAI returns a clear 400 error with a message like: “This model’s maximum context length is 128,000 tokens. Your messages resulted in 131,244 tokens.” You can catch this programmatically and handle it. At least you know it happened.
Silent truncation. Some providers and wrappers quietly drop the end of your input to fit. The model generates based on the truncated version and doesn’t tell you what it missed. If the important part of your document was at the end, you’ll get a confident, well-formatted, and wrong answer. This is the worst failure mode because you don’t know it happened unless you already know the right answer.
Degraded attention on long inputs. Even when the model accepts your full document, its attention isn’t uniform. Research on long-context models consistently shows reduced accuracy for content in the middle of very long inputs. You can technically fit a 700-page document into Gemini 2.5 Flash’s context window. But don’t assume page 1 and page 350 receive equal attention. In practice, critical information buried in the middle of large inputs gets lost more often than critical information at the start or end.
Context length is a ceiling, not a quality guarantee. I try to remind myself of this every time I’m tempted to just paste a big document and hope for the best.
The output token trap most people miss
Here’s the part that trips people up regularly. I’ve seen this catch developers in production more than once, including myself the first time I built a document analysis tool.
Output tokens come out of the same budget as input tokens. If you send 950,000 tokens to Gemini 2.5 Flash, the model only has 98,576 tokens left to respond. That sounds like a lot, but a detailed analysis of a 700-page document might need more than that, especially if you’re asking for a structured output with multiple sections.
Most providers set a separate maximum output token limit on top of the context window cap. For Gemini 2.5 Flash, the default max output is 8,192 tokens, though you can raise it to 65,536 via the API. For GPT-4o, the max output is 16,384 tokens.
The real calculation for a production application:
Available output = Total context window - Input tokens
If your system prompt is 2,000 tokens, your document is 80,000 tokens, and your query is 50 tokens, you’ve used 82,050 tokens of input. On GPT-4o’s 128K window, that leaves 45,950 tokens for output. Comfortable. But swap in a 120,000-token document and you’ve got 5,950 tokens for output. The model won’t necessarily error. It’ll just cut off.
Always plan both sides of the budget. I keep this in mind as a checklist item before I finalize any prompt for a production feature.
Four approaches when your content doesn’t fit
When the input is too long, you have four options. Which one works depends on your task.
1. Chunk and process separately. Split the document into pieces small enough to process individually. Run each piece, then combine the outputs. Works well for summarization or extraction where the answer doesn’t need the whole document in context simultaneously. The failure mode: information that spans chunk boundaries gets fragmented. A contract clause that starts on page 10 and ends on page 11 might get split, with each half interpreted without the other.
2. Summarize the non-essential sections first. If 80% of your document is background context and 20% is what the model actually needs, summarize the background in a first call, then combine that summary with the critical section for the second call. Two calls instead of one. Usually produces better results than sending the raw 80% as tokens the model barely uses.
3. Extract selectively before sending. You often don’t need the whole document. If you’re asking “what are the payment terms in this contract?”, you don’t need all 80 pages. Extract the relevant section with string matching, regex, or fuzzy search, and send only that. For structured documents with clear section headers, this is faster and more reliable than chunk-and-combine.
4. Use RAG for repeated queries over large document sets. For applications where users query large corpora repeatedly, RAG is the right architecture. Documents are embedded into a vector database. Each query retrieves only the relevant chunks. Only those chunks go into the context window. The model never processes the full corpus, just the pieces that match the current query. What is RAG? covers the full architecture if you want to understand it before building.
For one-off large documents, I always start with option 3 before adding the complexity of RAG. It’s usually enough, and it’s much faster to implement.
A quick pre-flight check in Python
Before sending a large document, a token count call takes about 200ms and tells you what you’re working with:
from google import genai
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
response = client.models.count_tokens(
model="gemini-2.5-flash",
contents=your_document_text
)
max_context = 1_048_576
max_output = 65_536
available_for_output = max_context - response.total_tokens
print(f"Input tokens: {response.total_tokens:,}")
print(f"Available for output: {available_for_output:,}")
if available_for_output < 2_000:
print("Warning: very limited output budget. Consider chunking.")
You’re using input-token quota for the count call, but you’d use the same quota on the generation call anyway. The count call just tells you upfront whether the generation call will produce what you expect. I think of it as a pre-flight check: 200ms and a tiny bit of quota to avoid a silent failure at the worst possible time.
For Claude instead of Gemini, the Anthropic messages count tokens API works the same way.
Try It Yourself
The context window concept is one of those things that clicks when you observe it with real numbers rather than just read about it.
You’ve got the mental model. Now see the numbers in action. TinkerLLM Lesson 12 runs real prompts against Gemini from your browser. Bring your own Gemini API key (free from Google AI Studio), and the exercises run live.
Open Lesson 12: Context Windows: Memory, Limits & What Gets Forgotten →
FAQ
What’s the context window for Gemini 2.5 Flash?
1,048,576 tokens, which is roughly 700,000 words of English text. That’s enough to hold about 5 average novels or a 2,000-page PDF simultaneously. In practice, the ceiling you’ll hit before the context limit is usually the max output token setting (8,192 by default, up to 65,536 via API) or degraded attention on very long inputs. For documents under 100 pages, the context window itself won’t be your bottleneck.
Does the free tier have a smaller context window?
No. The context window size is identical on free and paid tiers for Gemini. What differs is rate limits: tokens-per-minute (TPM) and requests-per-minute (RPM). On the free tier, you can’t process a 1,000,000-token document in under a minute anyway because you’d hit the TPM ceiling first. For occasional large document calls, the free tier handles them fine. The current limits are on the Gemini API rate limits page.
Why does my model answer based on the wrong part of the document?
This is usually the “lost in the middle” problem. Models tend to pay more attention to content near the beginning and end of a long input. Information in the middle of a very long document gets less weight during generation. One fix: put the most important content at the start of your prompt, before any large document. Another fix: don’t send the whole document. Extract only the relevant sections using string matching or keyword search before you pass anything to the model.
Is a larger context window always better?
Not necessarily. Larger context windows cost more in latency and in API billing on paid tiers. The lost-in-the-middle problem is more severe at extreme context lengths. And using a 1-million-token window for a 500-word question is wasteful in a way that actually degrades results. For most real applications, the question isn’t “how much can I fit?” but “how little can I send while still getting the right answer?” Usually, less context and more selective extraction beats brute-force full-document stuffing.
How much does a 1-million-token call cost?
On Gemini 2.5 Flash’s paid tier, input pricing is roughly $0.15 per million tokens for prompts under 200K tokens and $0.90 per million for longer ones. A 1-million-token document call runs about $0.90-$1.00 in input costs, before any output tokens. On the free tier, there’s no monetary cost but strict rate limits mean you can’t send that volume in a short window. For comparison, GPT-4o costs about $2.50 per million input tokens and caps at 128K, so a 1M-token call isn’t possible there regardless of budget.
My application needs to process documents larger than any context window. What do I do?
RAG is the standard answer, but you don’t always need the full architecture. Start with chunking: split the document into sections, extract the chunks most relevant to the query (using keyword matching or a simple embedding search), and send only those. For most document Q&A tasks, sending the 3-5 most relevant paragraphs produces better results than trying to stuff the whole document in. And it costs a fraction of the API quota. RAG with a proper vector database makes sense when you’re doing this at scale, for many documents, many users, many queries per day.
Context windows are one of those concepts that saves you hours of debugging once you actually understand them. If you want to see the mechanics in action rather than just read about them, TinkerLLM Lesson 12 exercises run these exact scenarios from your browser.
Delivery lead at Kalvium Labs with a background in instructional design. Writes concept explainers and process posts. Thinks about how people actually learn before jumping to solutions.
LinkedInWant to try this yourself?
Open the TinkerLLM playground and experiment with real models. 50 exercises free.
Start Tinkering