Skip to main content
LLM Jun 12, 2026 13 min read 3 views

Gemini API Quota Exceeded: The Complete Fix Guide for Error 429 [2026]

Gemini API Quota Exceeded
Gemini API Quota Exceeded: The Complete Fix Guide for Error 429 [2026]
Gemini API error 429 quota exceeded? This guide covers all fixes for 2026 from exponential backoff and model switching to the paid-tier billing bug Go

Your Gemini API was humming along yesterday. Today, it's throwing 429 errors on every third request. Sound familiar?

You're not alone. Since December 2025, when Google quietly slashed free tier quotas without emailing a single developer, the Google AI Developer Forum has been flooded with complaints. Some paid-tier users are reporting they can only make 2–3 requests per day before hitting the wall. Others discovered they were being charged for requests that returned quota errors. The frustration is real, and the official documentation doesn't help much beyond a table of HTTP codes.

This guide covers everything: why error 429 happens (including some genuinely surprising causes), how to diagnose which limit you've actually hit, and working fixes for every scenario from the free tier to enterprise scale. If you're on a paid plan and still getting quota errors, there's a specific section for that too, because that's its own special kind of frustrating.

What Gemini API Error 429 Actually Means

Let's get the basics out of the way, because the error message itself is almost useless.

json

{
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"
  }
}

HTTP 429 means "too many requests" — the API is working fine, it's just rejecting you because you've exceeded an allocated quota. This is completely different from a 500-series error, which means something broke on Google's end. With 429, the fix isn't in your code logic; it's in how you manage request volume and timing.

The Gemini API enforces limits across four dimensions:

  • RPM (Requests Per Minute): how many calls you can make in any 60-second window
  • TPM (Tokens Per Minute): total input + output tokens processed per minute
  • RPD (Requests Per Day): a hard daily cap that resets at midnight Pacific Time
  • IPM (Images Per Minute): relevant only for image generation endpoints

The tricky part? Different limit types look identical in the error response. Knowing which one you've hit changes your fix completely.

The December 2025 Changes That Broke Everything

On December 7, 2025, Google made significant cuts to free tier quotas. No blog post. No email. No warning. Applications that had been running reliably for months started failing overnight.

The cuts were severe. For Gemini Flash, the free tier daily limit dropped from 250 requests per day to just 20 — a 92% reduction. Image generation on the free tier dropped to exactly zero IPM, meaning free accounts can't generate images at all. For Gemini 2.5 Pro, RPM was cut in half from 10 to 5.

Step 1: Diagnose Which Limit You've Actually Hit

Before applying any fix, figure out what you're dealing with. Applying the wrong solution wastes time.

Pattern 1 — RPM limit: You see errors in bursts, but requests succeed when you slow down. Errors clear within a minute. This is the most common pattern.

Pattern 2 — TPM limit: Errors happen even with infrequent requests, and they correlate with large prompts or long responses. If you're using Gemini with a 200K-token context window, remember that every request is burning tokens at scale — some developers are consuming their entire TPM allowance in a handful of calls.

Pattern 3 — RPD limit: Errors start appearing mid-day and increase throughout the afternoon. Everything works fine again after midnight Pacific Time. This is the daily cap.

Pattern 4 — "Ghost 429" on paid accounts: Your Google Cloud Console shows near-zero usage, but you're still getting 429 errors. This is a known bug. More on this below.

To check your actual quota consumption, go to Google Cloud Console → APIs & Services → Generative Language API → Quotas and Limits. This shows real-time usage against your limits. If the numbers don't add up, you've found a different problem.

Fix 1: Exponential Backoff (Works for Almost Everyone)

If you take one thing from this guide, make it this. Exponential backoff is the standard solution for transient rate limit errors, and it converts most 429 failures into eventual successes.

The idea: when a request fails with 429, wait a bit before retrying. Double the wait time on each subsequent failure. Add a small random jitter so all your clients don't slam the API simultaneously after a backoff period.

Python (using the tenacity library):

python

from tenacity import retry, wait_random_exponential, stop_after_attempt
import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

@retry(
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(5)
)
def call_gemini(prompt: str, model_name: str = "gemini-1.5-flash"):
    model = genai.GenerativeModel(model_name)
    return model.generate_content(prompt).text

# Usage
try:
    result = call_gemini("Summarize this document...")
    print(result)
except Exception as e:
    print(f"Failed after retries: {e}")

JavaScript/TypeScript:

typescript

async function callGeminiWithRetry(
  prompt: string,
  maxAttempts: number = 5
): Promise<string | null> {
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      const model = genAI.getGenerativeModel({ model: "gemini-1.5-flash" });
      const result = await model.generateContent(prompt);
      return result.response.text();
    } catch (error: any) {
      if (error.status === 429 && attempt < maxAttempts - 1) {
        const delay = Math.min(1000 * Math.pow(2, attempt), 60000);
        const jitter = Math.random() * delay * 0.1;
        await new Promise(r => setTimeout(r, delay + jitter));
      } else {
        throw error;
      }
    }
  }
  return null;
}

One important note: pay attention to the retry-after header in 429 responses. When Google includes it, it tells you exactly how long to wait. Using that value is more efficient than arbitrary backoff.

Fix 2: Switch to Gemini 1.5 Flash

This is the fastest fix and it costs you nothing.

Gemini 1.5 Flash wasn't touched in the December 2025 cuts. It still has 15 RPM and 1,500 RPD on the free tier. For many use cases — summarization, classification, Q&A on moderate-length documents — the capability difference between 1.5 Flash and 2.5 Pro is marginal. I've found it performs surprisingly well on structured tasks.

python

# Change this one line:
model = genai.GenerativeModel("gemini-2.5-pro")

# To this:
model = genai.GenerativeModel("gemini-1.5-flash")

That's it. If your application doesn't absolutely require 2.5's expanded reasoning, try this first.

Fix 3: Upgrade to Tier 1 (It's Actually Free)

Here's something a lot of developers don't realize: upgrading from free tier to Tier 1 is free. You just need to enable billing on your Google Cloud project.

Tier 1 jumps Gemini 2.5 Pro from 5 RPM to 300 RPM — a 60x increase. You don't pay anything until your usage exceeds the free tier allocations, and for most individual developers, that never happens.

The process:

  1. Go to Google Cloud Console
  2. Select your project → Billing → Link a billing account
  3. Go to Google AI Studio → your project → request tier upgrade
  4. Wait 24–48 hours for confirmation

One important caveat: rate limits are per project, not per API key. Creating additional API keys within the same project gives you nothing. If you want more quota, you either upgrade or distribute workload across multiple projects with separate billing.

Fix 4: The Google One AI Pro / Paid Tier Disconnect (2026 Bug)

This one is genuinely weird and has frustrated a lot of developers.

As of early 2026, paid Google One AI Premium subscriptions do not automatically provision paid-tier quotas in Google AI Studio. Even if your subscription shows "AI Pro," the underlying API key may be sitting on free tier limits — sometimes with limits showing as zero. Multiple developers have reported being able to make only 2–3 requests per day on paid accounts.

The fix, documented in Google's developer forums, is to manually link a billing account to your AI Studio project:

  1. Go to GCP Console and ensure your project has a billing account attached directly
  2. If you're a Google One subscriber, look for the "Google Developer Program Premium" perk in your benefits and manually claim it
  3. Navigate to AI Studio → API Keys → verify the tier shown next to your project

If your API key page still shows "Free Tier" after linking billing, that's the bug. The current workaround is to create a new Google Cloud project from scratch, link billing to it immediately, generate a new API key from that project, and switch to that key.

It's annoying. But it works.

Fix 5: Batch Requests and Add Spacing

For applications that process multiple items sequentially, two simple tactics reduce your RPM consumption dramatically.

Request spacing — add a delay between calls:

python

import time

def process_items(items: list, delay: float = 0.5) -> list:
    """0.5s delay = max 120 requests/minute. Well under most limits."""
    model = genai.GenerativeModel("gemini-1.5-flash")
    results = []
    for item in items:
        results.append(model.generate_content(item).text)
        time.sleep(delay)
    return results

Request batching — instead of 10 API calls for 10 items, put them all in one prompt:

python

def batch_process(items: list, batch_size: int = 5) -> list:
    model = genai.GenerativeModel("gemini-1.5-flash")
    results = []
    
    for i in range(0, len(items), batch_size):
        batch = items[i:i + batch_size]
        prompt = "Process each item and return results separated by '|||':\n"
        prompt += "\n---\n".join(batch)
        
        response = model.generate_content(prompt)
        results.extend(response.text.split("|||"))
    
    return results

Batching can reduce API calls by 80% or more. The tradeoff is that you lose granular error handling per item, and you need to be careful about total token length per request.

For Production Apps: Full Error Handling

If you're running something user-facing, you need more than retry logic. Here's a complete Python class that handles primary and fallback models, rate limiting, and comprehensive logging:

python

import time
import logging
from typing import Optional
import google.generativeai as genai
from google.api_core import exceptions

logger = logging.getLogger(__name__)

class GeminiClient:
    def __init__(
        self,
        api_key: str,
        primary_model: str = "gemini-1.5-flash",
        fallback_model: str = "gemini-1.5-flash-8b",
        max_retries: int = 5,
    ):
        genai.configure(api_key=api_key)
        self.primary = genai.GenerativeModel(primary_model)
        self.fallback = genai.GenerativeModel(fallback_model)
        self.max_retries = max_retries
        self._last_request = 0

    def _backoff(self, attempt: int) -> float:
        import random
        delay = min(1.0 * (2 ** attempt), 60)
        return delay + random.uniform(0, delay * 0.1)

    def _throttle(self, min_gap: float = 0.2):
        elapsed = time.time() - self._last_request
        if elapsed < min_gap:
            time.sleep(min_gap - elapsed)
        self._last_request = time.time()

    def generate(self, prompt: str, try_fallback: bool = True) -> Optional[str]:
        self._throttle()

        for attempt in range(self.max_retries):
            try:
                return self.primary.generate_content(prompt).text

            except exceptions.ResourceExhausted as e:
                logger.warning(f"429 on attempt {attempt + 1}: {e}")

                # Try fallback on first failure
                if try_fallback and attempt == 0:
                    try:
                        logger.info("Switching to fallback model")
                        return self.fallback.generate_content(prompt).text
                    except exceptions.ResourceExhausted:
                        pass

                if attempt < self.max_retries - 1:
                    wait = self._backoff(attempt)
                    logger.info(f"Retrying in {wait:.1f}s")
                    time.sleep(wait)

            except exceptions.InvalidArgument as e:
                logger.error(f"Bad request, not retrying: {e}")
                return None

            except Exception as e:
                logger.error(f"Unexpected: {e}")
                if attempt < self.max_retries - 1:
                    time.sleep(self._backoff(attempt))

        logger.error("Exhausted all retries")
        return None

# Usage
client = GeminiClient(api_key="YOUR_KEY")
result = client.generate("Analyze this text...")

Token Optimization: The Underrated Fix

Many developers hit TPM limits without realizing why. A few optimizations can cut token usage by 30–60%:

Set max_output_tokens explicitly. Without a limit, the model may generate much longer responses than you need.

python

response = model.generate_content(
    prompt,
    generation_config={"max_output_tokens": 500}
)

Trim conversation history. In multi-turn chats, don't send the entire conversation every time. Keep the last N turns or implement a sliding window. Long context windows are where token consumption explodes.

Cache repeated queries. If multiple users might ask the same thing, store and reuse responses rather than hitting the API each time.

Which Fix Is Right for You?

Let's make this concrete:

"I'm on the free tier and keep hitting limits" → Switch to Gemini 1.5 Flash first. If that's not enough, enable billing for Tier 1 (free).

"I'm a developer building an MVP" → Enable billing for Tier 1 + implement exponential backoff. That combination handles almost every situation.

"I'm on a paid plan and still getting 429 errors" → Check if your project shows "Free Tier" in AI Studio. If so, you're likely hitting the Google One billing disconnect bug. Create a new project with billing linked from the start.

"My app is growing and I need guaranteed throughput" → Look at Tier 2 (requires $250 cumulative spend) or Vertex AI for enterprise SLAs. Vertex AI offers provisioned throughput, which is essentially reserved capacity that can't be rate-limited.

"I need higher limits right now without tier upgrades" → API aggregator services pool quota across multiple projects and accounts. They typically use OpenAI-compatible endpoints, making integration simple. The tradeoff is adding a third-party dependency, which matters for compliance-sensitive applications.

Prevention: Stop Playing Whack-a-Mole

Once you've fixed the immediate problem, here's how to avoid it recurring:

Set up monitoring. In Google Cloud Console, you can configure alerts when quota consumption hits 70–80%. Getting a Slack notification before you hit 100% is much better than finding out from user reports.

Know when daily quotas reset. RPD limits reset at midnight Pacific Time — UTC-8 in winter, UTC-7 in summer. If you're running batch jobs, schedule them to start just after midnight PT to get a full day's allocation.

Watch your context window usage. Gemini's 200K+ context window is powerful, but each request that uses a 200K context costs 200x more tokens than a 1K-token request. If your TPM limit is 250,000 and you're making requests with large contexts, you can burn through it in seconds.

Test rate limit handling in staging. Intentionally trigger 429 errors in your test environment to verify your backoff and fallback logic actually works. It's easy to write retry code that looks correct but fails in subtle ways under real conditions.

Frequently Asked Questions

Why am I getting 429 errors even with low usage?

Two likely causes. First, the "ghost 429" bug — your quota dashboard looks fine, but the rate limiter is miscalculating. This mainly affects newly upgraded accounts and usually resolves within 48 hours or after switching models. Second, you might be on the Google One AI Pro tier but your AI Studio project wasn't automatically provisioned with paid-tier limits. Check the project tier shown in AI Studio.

Do multiple API keys give me more quota?

No. Rate limits apply at the project level, not per key. Three API keys in the same project share the same quota pool. To increase limits, you need to upgrade your tier or split workload across multiple projects.

When exactly does the daily quota reset?

Midnight Pacific Time. That's UTC-8 during standard time (roughly November–March) and UTC-7 during daylight saving.

Is Gemini 1.5 Flash significantly worse than 2.5?

For many practical tasks — summarization, classification, Q&A on documents, code generation for straightforward problems — not noticeably. The gap is most apparent in complex multi-step reasoning, very long document analysis, and nuanced instruction following. Test your actual use case rather than assuming you need the newest model.

How much does Tier 1 actually cost?

Nothing, until you exceed the free tier allocations. Even then, google Gemini 1.5 Flash costs roughly $0.075 per million input tokens. Most individual developers and small projects either stay within the free allocations or pay a few dollars per month.

The Short Version

Gemini API 429 errors are fixable. Here's the decision tree:

  1. Implement exponential backoff regardless of anything else — it solves transient RPM issues and should be in every production app
  2. Switch to Gemini 1.5 Flash if you don't specifically need 2.5 capabilities — its free tier limits were untouched by the December 2025 cuts
  3. Enable billing for Tier 1 — it's free until you exceed free tier allocations, and it unlocks 60x more RPM
  4. Check for the paid-tier billing bug if you're on a Google One subscription and still hitting limits — manually link billing to your AI Studio project
  5. Consider Vertex AI if you need guaranteed uptime and can't tolerate rate limiting at all

The Gemini ecosystem has solutions at every scale. The December 2025 changes were disruptive, but the platform is still competitive — it just requires a bit more setup than it used to.

Sources:

Google AI Studio official documentation, Google AI Developer Forum, Google Cloud Console Quotas, community reports from the Google Gemini CLI GitHub repository.

Avatar photo of Eric Samuels, contributing writer at AI Herald

About Eric Samuels

Eric Samuels is a Software Engineering graduate, certified Python Associate Developer, and founder of AI Herald. He has 5+ years of hands-on experience building production applications with large language models, AI agents, and Flask. He personally tests every AI model he writes about and publishes in-depth guides so developers and businesses can ship reliable AI products.

Related articles