LLM Cost Optimization: Cut Your API Bill

Your staging environment ran 8,000 API calls over a weekend of testing. The bill was $4.12. You moved to production last Tuesday. The bill for that first week: $340. Same code. More users. The cost curve was steeper than the user curve.

LLM cost optimization isn’t about being cheap. It’s about the math working at scale. A prompt that costs $0.004 per call feels free. That same prompt fired 100,000 times a month costs $400. Add retrieval calls, embedding requests, and output tokens, and you’ve built something that quietly scales linearly with every new user.

Here are the 6 levers that actually move the number, in order of impact.

Why LLM Costs Surprise Developers

Token-based pricing is multiplicative, and most developers don’t feel it until it’s too late.

You pay for input tokens (your prompt plus context), output tokens (the model’s response), and the per-token rate varies by model tier. A Gemini Pro request with a 2,000-token prompt and a 500-token response costs around $0.013. That same request on Gemini Flash costs around $0.0013. One-tenth the price.

Do that 500,000 times a month, a modest production workload, not a large one, and you’re looking at $6,500 versus $650. The product behavior is the same. The model choice is different.

The second surprise is context accumulation. A chat session that starts with a 50-token message becomes a 3,000-token context window after 20 turns of history. If you’re including the full conversation in every API call, your input token count grows with every exchange, even if the new message is five words. Most developers notice this only after their first real cost report.

1. Match the Model to the Task

This is the highest-leverage change you can make, and it doesn’t require new infrastructure.

Not every API request needs your most capable model. Classify this email as spam or not-spam doesn’t need Gemini Pro. Translate this address doesn’t need GPT-4o. Summarize this paragraph into two sentences doesn’t need Claude 3.5 Sonnet.

Route requests by complexity:

Task type	Appropriate model tier
Classification, yes/no, extraction	Flash / Haiku / GPT-4o Mini
Summarization, translation, reformatting	Flash / Sonnet
Complex reasoning, code review, multi-step logic	Pro / Sonnet / GPT-4o
Long-document analysis with retrieval	Pro + caching

The 10× price difference between Flash and Pro tiers means you can afford to use the expensive tier only where it genuinely matters. On most production workloads, 60-70% of requests are simple enough for the cheapest tier. Routing them there cuts your bill before you touch anything else.

LiteLLM is a clean open-source proxy for doing this programmatically. It gives you a unified API surface across OpenAI, Anthropic, and Google, and lets you set routing rules per request type. It also handles fallbacks. If Flash fails on a complex task, you route to Pro as a fallback rather than returning an error.

2. Measure Tokens Before You Scale

You can’t optimize what you haven’t measured.

Most developers guess at their token counts. They’re usually wrong by a factor of 1.5 to 3. A system prompt that “feels like 200 tokens” often runs 400-600 once you factor in instruction padding, JSON schema overhead, and how the specific tokenizer treats your vocabulary.

Before you move a prompt to production, count it. The How to Count Tokens Before Sending a Prompt post covers the tools: tiktoken for OpenAI models, the Google Generative AI SDK’s count_tokens() method for Gemini, and Anthropic’s token counter endpoint.

Count both the prompt and a representative sample of outputs. Average output length is often the bigger unknown, especially for generation tasks where you haven’t capped verbosity.

Set a max_tokens (or max_output_tokens) limit on every call. If you don’t, you’re letting the model decide how much output to produce. That’s fine for open-ended generation. For structured extraction tasks where the answer is at most 50 tokens, it’s unnecessary spend.

3. Compress the Prompt

Most production prompts are 20-40% longer than they need to be.

Common sources of bloat:

Repeated instructions spread across a long system prompt and the user message
Verbose formatting guidance (“Please provide your answer in the following structured format…”) that a schema would replace
Full document text when only a section is relevant to the query
Old conversation turns that haven’t been pruned

You don’t compress by removing precision. You compress by removing redundancy. A system prompt with “You are a helpful assistant. Always be polite. Be concise. Format your answers clearly. Do not use jargon.” can often become “Be concise and plain-language.” if your users have already been filtered for appropriate contexts.

For RAG applications, fetch only the chunks you’ll actually use. A 10-chunk retrieval where 7 chunks aren’t relevant to the query is token spend that doesn’t help the answer. Better retrieval precision is prompt compression by another name.

You can also instruct the model to be brief explicitly. “Respond in 2-3 sentences unless more detail is necessary” changes output length without changing quality for most extraction and classification tasks.

4. Use Caching

Caching is the closest thing to free cost reduction in the LLM stack. You’re reusing work you already paid for.

Two types worth using:

Exact-match caching. If two requests are identical, return the cached response. This catches repeated system prompts, static lookup queries, and identical user inputs, which are more common than you’d expect in structured applications. A Redis layer in front of your API client handles this at the application level.

Native prompt caching. Anthropic’s prompt caching feature lets you cache the beginning of your context, typically your system prompt, and pay 90% less for those cached input tokens on subsequent requests. Google’s Gemini API has equivalent context caching. If your system prompt is 2,000 tokens and it’s the same for every user session, you’re paying 90% less on the biggest chunk of your input cost from the second request onward.

The caveat: cached responses can go stale. A product description cached at 9am may be wrong after a 10am catalog update. Set cache TTLs based on how frequently your underlying data changes. Stale output costs more in support and user trust than the token savings are worth.

5. Trim the Context Window

Every token in your context window costs money, including history the model doesn’t need.

For conversational applications, the most common cost mistake is including all previous turns in every API call. A 20-turn conversation where each turn averages 100 tokens means your 21st call includes 2,000 tokens of history before you’ve written the new message. That history cost scales linearly with session length.

Two approaches that work:

Sliding window. Keep only the last N turns (8-10 is usually enough for coherence). Drop the oldest turns as new ones arrive. Simple to implement, effective for most use cases.

Rolling summary. Compress old turns into a short summary, then include the summary plus the recent raw turns. You get more context at fewer tokens. The model can reference older information via the summary without you paying per-turn for all of it.

The Context Windows Explained: Why Your Long Prompt Gets Cut Off post covers how context limits work at the model level. Understanding the limit helps you see why trimming aggressively matters beyond cost. Context limits mean your expensive old history gets dropped anyway once the window fills. You might as well drop it intentionally and cheaply.

6. Batch Non-Urgent Requests

Real-time API calls are for interactive use cases. Not everything you’re calling is interactive.

Tagging a batch of uploaded documents? Not interactive. Generating product description variants for your catalog overnight? Not interactive. Summarizing support tickets from last week? Not interactive.

OpenAI’s Batch API processes asynchronous workloads at 50% of the standard price. You submit a batch, get it back within 24 hours. For anything that doesn’t need a response in under a few seconds, this halves your cost on that workload without any quality trade-off.

Google’s batch prediction endpoints and Anthropic’s message batches offer comparable savings. The pattern is the same: queue non-urgent work, batch it, process overnight, use the results the next day.

Actual savings compound with workload size. A team processing 100,000 document summaries per week at $0.002 per summary saves $100/week from batching alone. That’s $5,200/year on one workflow.

The Trade-offs That Come With Every Cut

LLM cost optimization isn’t cost-free. Every lever involves a trade-off.

Routing to a cheaper model produces worse output on complex tasks. You need to benchmark before routing, not after. Run your actual prompts through both model tiers, compare accuracy on your real test cases, then decide. “Flash is fine for this” should be an empirical statement, not an assumption.

Caching introduces staleness risk. Cache TTLs need to be shorter than your data update frequency, not longer. Getting this wrong means users get stale answers, and support volume goes up.

Context trimming can remove information the model needed. If your sliding window drops a user preference stated 12 turns ago, you’ll get an answer that ignores it. Test conversational coherence after trimming, especially for support or personalization use cases.

Prompt compression can strip instructions that were load-bearing even if they looked redundant. Test compressed prompts on your edge cases, not just your happy path.

None of this means don’t optimize. It means measure first, then optimize, then verify the quality held.

Try It Yourself

LLM cost optimization has hands-on exercises in Module 3.

Open Unit 29: Cost & Token Optimization →

Unit 29 covers: measuring token counts on real prompts, comparing output quality across model tiers, configuring prompt caching, and benchmarking before routing. Bring your own Gemini API key from Google AI Studio (it’s free). TinkerLLM has 176 exercises across 23 units. Module 1 (50 exercises) is free, no card needed.

FAQ

How much can LLM cost optimization realistically save?

For apps that haven’t been optimized, 50-80% is achievable. The biggest single win is usually model routing. If you’re using a Pro-tier model for everything, routing 60-70% of requests to Flash typically cuts your total API bill in half before you change anything else. The second-biggest win is caching your system prompt with Anthropic or Google’s native caching, especially if your system prompt is over 1,000 tokens and consistent across sessions.

Which is cheaper: GPT-4o, Claude 3.5, or Gemini Pro?

All three sit in roughly the same pricing tier for high-quality reasoning. But pricing changes often enough that the only reliable answer is to check the current pages directly: OpenAI pricing, Anthropic pricing, Google AI pricing. For production decisions, run your actual prompts through all three on a sample of real inputs and compare cost-per-correct-output, not just cost-per-token. Output quality differences can mean a cheaper model costs more in practice if it requires more retries or human review.

What’s the difference between prompt caching and output caching?

Prompt caching stores the beginning of your context (typically the system prompt) at the infrastructure level, so you pay less for those input tokens on subsequent requests that share that same prefix. Output caching stores the model’s response so identical requests return the cached result without a new API call. Both reduce cost, but they work at different layers. Prompt caching reduces input spend. Output caching eliminates the API call entirely for repeated identical requests. You can stack both.

Does TinkerLLM teach API cost management?

Yes. Unit 29 (Cost and Token Optimization) in Module 3 covers caching, model routing, and prompt compression through hands-on exercises with real API calls. The full curriculum has 176 exercises across 23 learning units and 3 modules. Module 1 (50 exercises, covering prompt engineering foundations) is free with your Gemini API key. Full access is ₹499 / $9 lifetime.

When should you not optimize for cost?

When quality loss isn’t acceptable for a specific task. Don’t route legal document analysis or safety-critical classification to a cheaper model because you want to cut costs. Don’t compress a system prompt that enforces safety guardrails. Don’t trim context from a conversation where remembering user preferences matters. Optimize where the quality trade-off is acceptable. For tasks where it isn’t, pay for the right model and treat it as a product decision, not a waste.

Stop guessing at your API bill. TinkerLLM’s Unit 29 runs real cost comparison exercises: prompt caching, model routing, token measurement, with your own Gemini key and real API calls. The first 50 exercises are free, no card needed.