How to Use the Gemini API in Python (Step by Step)
Install google-genai, make your first Gemini API call in Python, add system prompts, handle multi-turn chat, stream responses, and fix common errors.
TL;DR
- • Use `google-genai`, the SDK Google recommends for Python in 2026. Not the older `google-generativeai`.
- • Your API key goes in an environment variable. Hardcoding it in a script is how keys get leaked.
- • System instructions, streaming, and multi-turn chat are three distinct patterns with different code structures.
- • 400, 403, and 429 errors have different root causes and need different fixes. Never retry a 429 immediately.
- • TinkerLLM Lesson 23 covers production Gemini API integration hands-on, including streaming and provider-switching.
You followed the Google AI Studio quickstart, got your API key, and started reading the Python docs for the Gemini API. Then you hit the first problem: there are two Python packages. They have similar names, similar README examples, and older tutorials use one while the current Google docs use the other. Nothing explains which to pick.
This is where most people get stuck on their first Gemini API Python call. Not the API itself. The setup. I’ve watched engineers spend an hour on this before writing a single line that actually runs.
Here’s the answer upfront: use google-genai. It’s the SDK Google recommends for Python in 2026. The older google-generativeai package still works but is being superseded. If you’re starting fresh, start with google-genai.
I put together this tutorial to cover every pattern you’ll actually need: making a basic call, adding system instructions, handling multi-turn conversations, streaming responses, and dealing with the three error types that come up most.
What You’ll Need Before Starting
Two things. A Gemini API key and Python 3.8 or newer.
If you don’t have an API key yet, get one from Google AI Studio in about five minutes. It’s free, no credit card required. The free tier is real enough for everything in this tutorial.
Run python --version to check your Python version. If you’re on 3.8 or newer, you’re fine. If you’re on something older, the SDK won’t install cleanly.
No Google Cloud account needed. No billing setup. No service account credentials. The API key from AI Studio is sufficient.
Install the SDK
pip install google-genai
If you run into dependency conflicts, create a virtual environment first:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install google-genai
Don’t install google-generativeai alongside google-genai. They’re separate packages and mixing them creates import confusion that can take an hour to untangle. I’ve seen this derail a whole setup session before anyone writes actual API code. Pick google-genai. Stick with it.
Set Your API Key as an Environment Variable
Don’t put your key directly in your script. Scripts get committed to version control. API keys in version control get found by automated scanners within hours.
The correct pattern:
export GEMINI_API_KEY="your-api-key-here"
On Windows:
set GEMINI_API_KEY=your-api-key-here
Or use a .env file with python-dotenv if you’re building something with multiple configuration values:
pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()
Then your API client initialization always looks the same:
import os
from google import genai
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
The key never touches source control. And you can rotate it without changing code.
Your First Gemini API Call in Python
import os
from google import genai
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Explain what a context window is in two sentences."
)
print(response.text)
Run this and you’ll get a real response from Gemini. A few things to know:
gemini-2.5-flash is the right default model. It’s faster than gemini-2.5-pro, has a more generous free-tier quota, and handles most text tasks well. Use Pro only when you specifically need stronger multi-step reasoning on complex problems.
contents is the user message. For a single text turn, a plain string works. For multi-turn conversations (covered below), you pass a list.
response.text is the most common way to access the output. It returns the full generated text as a string. If you need finish reason, token counts, or safety ratings, they’re at response.candidates[0].
Add a System Instruction
A system instruction sets persistent context that applies to every response from the model: the persona it should play, constraints on its behavior, the format you expect.
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="What's the best way to handle errors in a production LLM integration?",
config={
"system_instruction": (
"You are a senior backend engineer reviewing code at a startup. "
"Be direct and specific. Name actual libraries. Keep answers under 150 words."
)
}
)
print(response.text)
The system instruction applies to this call but not to the next one. If you want a consistent persona across multiple calls, you pass the instruction every time. This caught me off guard the first time I used the API after spending time in AI Studio, where you set it once in the UI and it applies to the whole session. In the API, you own the state.
Multi-Turn Conversations
A single generate_content call is stateless. The model has no memory of previous calls. To build a conversation, you pass the full history yourself:
from google.genai import types
history = []
def chat(user_message):
history.append(
types.Content(role="user", parts=[types.Part(text=user_message)])
)
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=history
)
reply = response.text
history.append(
types.Content(role="model", parts=[types.Part(text=reply)])
)
return reply
print(chat("What is tokenization?"))
print(chat("And how does it affect the cost of an API call?"))
The second question (“how does it affect cost”) makes sense because the model sees the full conversation history. Without the history, it’d give you a generic answer about API pricing with no connection to tokenization.
And because you own the history list, you control what the model remembers. You can trim it, summarize older turns, or drop exchanges when the context window starts getting large. The API doesn’t manage any of this for you. I find that understanding this early saves a lot of confusion later when conversations start degrading or hitting context limits.
Streaming Responses
For short prompts, waiting for the full response is fine. For longer outputs, streaming is better. The user sees tokens arriving as they’re generated instead of waiting for the whole response.
for chunk in client.models.generate_content_stream(
model="gemini-2.5-flash",
contents="Write a 200-word explanation of how RAG works."
):
print(chunk.text, end="", flush=True)
print() # newline after the stream ends
The method is generate_content_stream instead of generate_content. Each chunk has a text attribute with the incremental output. The flush=True matters: without it, some terminals buffer the output and you see it all at once anyway.
In a web application, you’d push each chunk to the client over a server-sent events connection or a WebSocket. The API mechanics are identical.
Adjusting Temperature and Output Length
The default generation settings work fine for most tasks. But two parameters are worth knowing:
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Summarize this in 3 bullet points: ...",
config={
"temperature": 0.2,
"max_output_tokens": 300,
}
)
Temperature controls randomness. 0.0 gives the same output for the same input every time. 1.0 gives much more variation. For factual tasks (summarization, classification, extraction), I default to 0.1 to 0.3. For creative tasks, 0.7 to 1.0.
max_output_tokens caps response length. Useful when you need brief answers or when token costs matter. Without a cap, the model decides how much to write. In my experience, that’s usually two or three times more than you actually need for short tasks.
Try It Yourself
If you want hands-on exercises on production Gemini API integration, TinkerLLM’s Lesson 23 covers it end-to-end: streaming, multi-modal inputs, switching between Gemini, OpenAI, and Anthropic APIs, and rate limit handling patterns.
Open Lesson 23: LLM APIs in Production →
Lesson 23 is Module 3 content (₹499 / $9 lifetime for the full course). Module 1 has 50 free exercises covering the prompt engineering foundation, no card needed. TinkerLLM is BYOK: your own Gemini API key from Google AI Studio, stored in your browser, never on our servers.
Handling Common Errors
Three error types cover most of what you’ll run into.
400: Invalid argument. Usually a malformed request. Check your contents structure if you’re passing a list, and double-check the model name. gemini-2.5-flash works. gemini-flash-2.5 doesn’t, and you’ll get a 400 with a model not found message.
403: Permission denied. Your API key is invalid, scoped to the wrong APIs, or you’re requesting a model that requires billing enabled. I’ve seen this happen when someone copies a Cloud service account key instead of the AI Studio API key. Check the key in Google AI Studio and verify it’s active.
429: Rate limited. You’ve hit a requests-per-minute, requests-per-day, or tokens-per-minute ceiling. The full breakdown of Gemini’s free-tier limits and how to handle each type is in the Gemini API rate limits guide. The core rule: never retry a 429 immediately. Use exponential backoff.
import time
import random
from google.api_core.exceptions import ResourceExhausted
def generate_with_retry(prompt, max_retries=4):
wait = 1
for attempt in range(max_retries):
try:
return client.models.generate_content(
model="gemini-2.5-flash",
contents=prompt
)
except ResourceExhausted:
if attempt == max_retries - 1:
raise
time.sleep(wait + random.uniform(0, 0.5))
wait *= 2
Catch the 429, wait with exponential growth, add random jitter so parallel threads don’t synchronize and hit the API in waves, cap the retries. Without the cap, a persistent daily limit exhaustion will keep your script running against a closed door.
FAQ
What’s the difference between google-genai and google-generativeai?
Two separate Python packages from Google. google-generativeai was the original Gemini API client, released in late 2023. google-genai is the newer SDK, recommended by Google since 2025 and aligned with the current API design. Tutorials from 2023 or early 2024 will use google-generativeai with import google.generativeai as genai. Current Google docs point to google-genai with from google import genai. Both work today, but for new projects, use google-genai.
Do I need a Google Cloud account to use the Gemini API from Python?
No. The Gemini API with a key from Google AI Studio works without a Google Cloud project, billing setup, or service account credentials. The key from AI Studio is sufficient for the free tier. If you need commercial use, data privacy guarantees, or higher quotas, you’ll enable billing, but that’s still through AI Studio’s account settings, not a separate Google Cloud project.
How much does calling the Gemini API in Python cost?
On the free tier, it’s free within quotas: roughly 15 requests per minute and 1,500 requests per day for Gemini 2.5 Flash as of 2026. At normal development pace, the free tier handles all the learning and prototyping you’ll do. Paid pricing is per million tokens. Flash is substantially cheaper than Pro. Check the official Gemini pricing page for current rates since they change periodically.
How do I switch between gemini-2.5-flash and gemini-2.5-pro?
Change the model parameter. That’s the entire change. "gemini-2.5-flash" becomes "gemini-2.5-pro" and everything else stays identical. The SDK, the response format, and all the patterns in this post work against both models. The differences are quality (Pro reasons better on complex tasks), latency (Pro is slower), and free-tier quota (Pro is much tighter). Start with Flash, reach for Pro when quality differences specifically matter.
What’s the right way to pass a system instruction alongside conversation history?
Pass the system instruction in the config dict and the conversation history in contents. They’re separate fields and don’t interfere with each other:
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=history, # your history list
config={
"system_instruction": "You are a concise technical assistant."
}
)
The system instruction applies to the entire conversation, not just one turn. You’ll need to pass it on every call since the API is stateless.
You’ve got the Gemini API working in Python. Now build something with it. TinkerLLM Lesson 23 takes you into production patterns: streaming, multi-modal inputs, switching providers, and rate limit handling. Module 1’s 50 exercises are free, no card needed.
Delivery lead at Kalvium Labs with a background in instructional design. Writes concept explainers and process posts. Thinks about how people actually learn before jumping to solutions.
LinkedInWant to try this yourself?
Open the TinkerLLM playground and experiment with real models. 50 exercises free.
Start Tinkering