Gemini API Rate Limits: Free Tier Quotas 2026

Your script worked fine in testing. Twenty calls, twenty clean responses, no problems. Then you wired it into something real, usage picked up, and at some point you started seeing 429 errors with no obvious pattern.

That’s when you learned the Gemini API free tier has limits. And that there are three different types of limits, each one failing in a different way.

This post explains what you’re working with on the free tier, what the numbers look like, and how to handle rate limits in code before they turn into silent failures in production.

What the Gemini API Free Tier Actually Is

Google calls it the Developer tier. It’s a real production API with genuine limits, not a sandbox that disappears after 30 days.

Two things distinguish it from paid tiers. First, your prompts and responses may be used for model improvement on the free tier. On paid tiers, Google doesn’t use your data for training. If you’re processing anything business-sensitive or building a commercial product, you need billing enabled regardless of whether you’re hitting quotas.

Second, the free tier explicitly excludes commercial use. If you’re charging users money or running a revenue-generating service, you need a paid plan. For learning, prototyping, or a side project that isn’t handling real user data, the free tier is genuinely capable. Going through all 247 TinkerLLM exercises won’t come close to daily limits at normal learning pace.

Gemini offers two main models on the free tier: Gemini 2.5 Flash and Gemini 2.5 Pro. Flash is faster and more generous with quotas. Pro is more capable but the free limits are much tighter. Most developers start with Flash and reach for Pro only when quality differences specifically justify it, like complex multi-step reasoning or long-document analysis.

The Three Limit Types: RPM, RPD, and TPM

This is where most 429 errors come from: there are three different ceilings, and they look identical from the outside. All three return HTTP 429, but they require different responses.

RPM (Requests Per Minute)

This is the most visible limit. You can only make a certain number of API calls per minute, measured as a rolling window, not a fixed clock minute. As of early 2026, Gemini 2.5 Flash sits around 15 RPM on the free tier. Pro is roughly 5 RPM.

15 RPM sounds generous until you build something that processes items in a tight loop or fires parallel requests. If you send 20 items to asyncio.gather() at once, you’ll hit the RPM limit before the batch is halfway done.

RPD (Requests Per Day)

Even if you stay under the per-minute cap, there’s a daily ceiling. Free tier Flash allows roughly 1,500 requests per day. Pro is substantially lower.

RPD catches patterns that RPM misses. A script running at 5 requests per minute for 6 hours looks fine from a rate perspective but hits the daily limit around hour 5. You won’t know until the 429s start arriving.

TPM (Tokens Per Minute)

This one surprises people the most. You can be well under the RPM limit and still get a 429 because your requests are large.

TPM counts the combined input and output tokens across all requests in a rolling minute. Free tier Flash is fairly generous on TPM (often over a million tokens per minute on paper). But if you’re sending large documents or context-heavy system prompts, individual calls can be large enough to push you against the TPM ceiling even at low request rates.

A script that summarizes 10,000-token documents will burn through TPM much faster than one sending short questions, even if the request count looks identical.

The Gemini API pricing page and rate limits documentation are the sources of truth for current numbers. They update periodically, so don’t trust secondhand figures, including the ones in this post, for anything you’re building against specific limits.

How to Read a 429 Error

When you hit a rate limit, the API returns HTTP 429 with a structured error body. The important part is the error message, which usually tells you which limit type was hit.

In Python with google-generativeai:

from google import genai
from google.api_core.exceptions import ResourceExhausted
import os

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

try:
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents="Your prompt here"
    )
except ResourceExhausted as e:
    print(f"Rate limited: {e}")
    # e.message usually specifies: quota exceeded for
    # requests per minute, requests per day, or tokens per minute

The error message distinguishes RPM from RPD from TPM limits, which tells you whether you need a short backoff (RPM/TPM limits, which reset within a minute) or a long wait (RPD, which resets at midnight Pacific Time).

In Node.js with @google/genai:

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

try {
  const response = await ai.models.generateContent({
    model: "gemini-2.5-flash",
    contents: "Your prompt here",
  });
} catch (err) {
  if (err.status === 429) {
    console.error("Rate limited:", err.message);
  }
}

What Breaks First (and Why)

Three patterns reliably hit rate limits before developers expect them.

Parallel requests. You have 50 documents to process. You use asyncio.gather() or Promise.all() and fire them all at once. Flash’s 15 RPM becomes 50 requests hitting in the same second. You’ll get 429s on most of the batch, and your retry logic will compound the problem if it isn’t careful.

The fix is a rate limiter in your async code. Keep a counter of requests made in the last 60 seconds, and sleep until space opens before firing the next one. Libraries like aiolimiter in Python or p-throttle in Node handle this cleanly. Don’t implement it from scratch if you can help it.

Token-heavy prompts. You’re summarizing long documents. Each request includes a 10,000-token document plus a system prompt. Even at 5 requests per minute, you’re pushing 50,000+ tokens per minute and may hit TPM limits depending on your tier.

The fix is chunking: split large documents rather than sending them whole. A 10,000-token document split into four 2,500-token chunks reduces your per-request token load significantly. The output takes more work to stitch together, but you stay within limits. For most summarization tasks, chunking also produces better output because the model focuses on smaller, coherent sections rather than skimming a long document.

Retry storms. You hit a 429, catch the error, and immediately retry. The retry hits the same limit. You retry again. Now you have a feedback loop that makes the situation worse and burns through whatever remaining quota you had.

The fix is exponential backoff with jitter: wait 1 second, retry. If it fails, wait 2 seconds, retry. Then 4 seconds. Then 8. The jitter is a small random delay (0 to 500 milliseconds) added to each wait, so retries from multiple threads or processes don’t synchronize and hit the API in waves.

Handling Rate Limits in Code

Here’s a pattern that covers the common cases in Python:

import time
import random
import os
from google import genai
from google.api_core.exceptions import ResourceExhausted

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

def generate_with_backoff(prompt, max_retries=5):
    wait_seconds = 1
    for attempt in range(max_retries):
        try:
            return client.models.generate_content(
                model="gemini-2.5-flash",
                contents=prompt
            )
        except ResourceExhausted:
            if attempt == max_retries - 1:
                raise
            jitter = random.uniform(0, 0.5)
            time.sleep(wait_seconds + jitter)
            wait_seconds *= 2
    return None

And in Node.js:

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

async function generateWithBackoff(prompt, maxRetries = 5) {
  let waitMs = 1000;
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await ai.models.generateContent({
        model: "gemini-2.5-flash",
        contents: prompt,
      });
    } catch (err) {
      const isRateLimit = err.status === 429;
      const canRetry = i < maxRetries - 1;
      if (isRateLimit && canRetry) {
        const jitter = Math.random() * 500;
        await new Promise(r => setTimeout(r, waitMs + jitter));
        waitMs *= 2;
      } else {
        throw err;
      }
    }
  }
}

Both patterns: catch the 429, wait with exponential growth, add random jitter, cap retries. The cap matters. Without it, a persistent daily limit exhaustion will keep your script running forever against a closed door.

When to Move to Paid

Hitting quotas isn’t the main reason to enable billing. There are three better reasons.

Commercial use. If your app generates revenue or you’re processing data on behalf of paying customers, you need a paid tier. Free tier terms exclude commercial use. This is a legal distinction, not a quota one.

Data privacy. Free tier prompts may be used for model improvement. If you’re processing user data, business documents, financial records, or anything regulated, you need billing enabled so your data stays out of the training pipeline.

Proven scale. If your app is getting real users and you’re regularly bumping daily limits, that’s a validation signal. You’ve built something people use. Enable billing, set a budget alert in Google Cloud Console so a runaway script doesn’t create an unexpected bill, and move on. Your code doesn’t change when you move to paid, just the limits.

What doesn’t warrant moving to paid: hitting limits on a test script, one-off development work, or cautionary “just in case” provisioning. The free tier is real capacity. Use it until you’ve earned the reason to upgrade.

Try It Yourself

If you want to go beyond the basics and build production-grade LLM integrations, including streaming, error handling, rate limiting, and switching between Gemini, OpenAI, and Anthropic APIs, TinkerLLM covers this in Lesson 23.

Open Lesson 23: LLM APIs in Production →

You’ll need the full course for Module 3 content (₹499 / $9 lifetime). Module 1 has 50 free exercises covering the prompting foundation, and you can get your Gemini API key from Google AI Studio in about five minutes if you don’t have one yet.

FAQ

What’s the difference between a 429 “rate limited” and a 429 “quota exceeded”?

Both are HTTP 429, but the error body distinguishes them. Rate limited usually means you hit an RPM or TPM limit in the current rolling window. A short wait (10 to 60 seconds) will clear it. Quota exceeded usually means you’ve hit the daily RPD limit and you won’t recover until the quota resets, typically at midnight Pacific Time. The fix for RPM/TPM is exponential backoff. The fix for RPD is waiting until tomorrow or switching to a paid plan with higher daily quotas.

Does using Flash vs Pro affect my rate limits?

Yes, significantly. Pro’s free-tier limits are much tighter, especially on RPD. Many developers use Flash for most requests and reach for Pro only when quality differences specifically matter, like complex reasoning or long-context processing. You can mix models in the same application with the same API key, and each model’s limits are tracked separately.

Do my daily limits reset at midnight?

Daily request limits (RPD) reset at midnight Pacific Time. Per-minute limits (RPM and TPM) use a rolling window rather than a fixed clock minute. If you send 15 requests in a 30-second burst, you’re rate-limited until 30 seconds later, not until the next clock minute starts. This rolling window behavior means bursting is worse than spreading, even if your hourly average looks fine.

Can I increase free tier limits without enabling billing?

Not meaningfully. Google doesn’t offer a “higher free tier” for free. The path to higher limits is enabling billing in Google Cloud Console and using a paid-tier project. You only pay for usage beyond the free thresholds, and you can set budget alerts to cap unexpected spend. The Gemini API rate limits docs show current paid-tier limits by model.

Is there a way to monitor my quota usage before I hit a limit?

Yes. Google AI Studio shows a usage dashboard with request counts by day and model. You can also monitor programmatically: the Gemini API returns quota headers in some responses (x-ratelimit-remaining-requests, x-ratelimit-reset-requests) that you can read before you hit the ceiling. Building a simple counter in your application that tracks requests per minute against your known RPM limit gives you early warning before 429s start arriving.

You’ve got the rate limits figured out. Now build something with the API. TinkerLLM Lesson 23 runs real production LLM integration patterns from your browser, including rate limiting and error handling. Module 1’s 50 exercises are free, no card needed.

Run your first exercise →

Gemini API Free Tier 2026: Limits and Rate Quotas

TL;DR