Vector Databases Explained for LLM Developers

You built a customer support chatbot using Gemini. It handles general questions well. But when a user asks about a product update that shipped last month, the model either invents an answer or admits it doesn’t know. Your product documentation lives across 400 Notion pages. No amount of prompting fits 400 pages into a context window.

I’ve seen this exact problem come up every time a team tries to build a knowledge-connected LLM product. The chatbot works until it doesn’t. And the moment it fails, someone asks “why can’t it just look this up?” The answer: it can, but only if you wire in the right retrieval layer.

This is the problem a vector database exists to solve. If you’re building anything that connects an LLM to a real knowledge base, you need to understand what a vector database actually does, why a regular SQL database can’t do it, and when you need one at all.

What Makes a Vector Database Different

A regular database stores facts. A vector database stores meaning.

Here’s the concrete distinction. Search a SQL table for “running shoes” and you get rows containing the literal words “running shoes.” Search a vector database for “running shoes” and you might also get results for “trail sneakers,” “athletic footwear,” and “jogging trainers,” because those items have similar meaning even though the words are different.

That’s semantic search. And it’s what makes vector databases useful for LLM applications, where users ask questions in their own words, not in the exact phrasing your documents use.

The mechanism behind it is embeddings. Before text can live in a vector database, you convert it into a numerical representation of its meaning.

How Embeddings Turn Text into Numbers

An embedding model reads your text and outputs a vector: a list of floating-point numbers, typically 768 to 3072 values depending on the model. These numbers encode semantic meaning.

The key property: semantically similar texts produce vectors that are numerically close to each other. “Running shoes” and “jogging trainers” produce vectors that point in roughly the same direction in this high-dimensional space. “Running shoes” and “tax filing deadline” produce vectors pointing in completely different directions.

You can think of it as coordinates. Every piece of text has a location. Similar texts cluster together. Your search query is a location lookup: “find me everything near this point.”

Embedding models you’ll actually use:

text-embedding-004 (Google, free within API rate limits): good default for most applications
text-embedding-3-small and text-embedding-3-large (OpenAI, pay-per-token): 3-large is higher quality, 3-small is cheaper for high-volume indexing
nomic-embed-text (Nomic, open-source): runs locally, no API cost, slightly lower quality

The embedding model doesn’t have to be from the same provider as your LLM. You can embed with Google models and generate with Claude, or vice versa. The only hard constraint: the model you use to embed your documents at indexing time must be the same model you use to embed queries at search time. Switch embedding models and you have to re-embed your entire corpus.

How Similarity Search Works

Once you have embeddings, the core operation in a vector database is finding the nearest neighbors to a query vector.

The most common distance function is cosine similarity. It measures the angle between two vectors, not their absolute distance. A score of 1.0 means the vectors point in exactly the same direction (identical meaning). A score of 0 means they’re perpendicular (unrelated topics). Negative scores indicate semantic opposition, which is rare in practice.

When you query a vector database:

Your query text gets embedded using the same model you used for your documents.
The database computes cosine similarity between your query vector and every stored vector.
It returns the top-k most similar document chunks.
Those chunks get injected into your LLM prompt as context.
The LLM generates a grounded response based on what was retrieved.

This is the retrieval half of Retrieval-Augmented Generation. If you haven’t read the full RAG explainer, What is RAG? Retrieval-Augmented Generation Explained Simply covers the complete pipeline from user query to final response.

Try It Yourself

💡 Try this hands-on: TinkerLLM’s RAG unit walks through embedding, storage, and retrieval steps hands-on. You run the queries yourself and see similarity scores return in real time.

Open Lesson 16: RAG: Giving LLMs a Knowledge Base They Were Never Trained On →

The first 50 exercises on TinkerLLM are free, no card needed. Module 1 covers prompt engineering foundations; Module 2 is where RAG and vector search live.

Popular Vector Databases Compared

The right choice depends on where you are in your build. There’s no single best option.

Database	Best for	Deployment	Rough scale ceiling
Chroma	Local prototyping, dev	Embedded or Docker	~100K vectors
pgvector	Already on Postgres	Self-hosted extension	~10M vectors with HNSW
Pinecone	Managed production	Fully managed cloud	Billions (with pods)
Weaviate	Open-source production	Self-hosted or cloud	100M+
Qdrant	Fast open-source prod	Self-hosted or cloud	100M+
Milvus	High-volume enterprise	Self-hosted	Billions

A few practical notes:

If you’re already on Postgres, start with pgvector. One extension, one new column type, one extra query operator. You get vector search without adding a new piece of infrastructure. For most applications under a few million documents, it performs well enough that there’s no reason to switch. The HNSW index in pgvector is within 5-10% of Pinecone’s recall at similar query speeds for datasets under 10M vectors.

Chroma is where tutorials start for a reason. It runs locally in Python with no server required. Great for experimenting and prototyping. It won’t scale to production workloads, but you don’t need it to for initial builds. Start there, learn the pattern, then switch to pgvector or Pinecone once you have something real.

Pinecone makes sense when you want infrastructure managed for you. The cost model scales with query volume and storage rather than server costs. It becomes expensive above a few million queries per month, but below that threshold the managed-service value is real: no index tuning, no capacity planning, no index rebuild when your dataset grows.

The Part People Get Wrong About Performance

The expensive operation in a RAG pipeline isn’t the LLM call. It’s the embedding step at indexing time.

When you add documents to your vector database, every document chunk needs to be embedded first. A 400-page knowledge base might produce 100,000 text chunks after splitting at 512 tokens with overlap. Embedding 100,000 chunks takes time and costs money (if you’re using a paid embedding model). The embedding model you choose affects latency and cost across your entire pipeline, not just search quality.

I’ve watched teams optimize their LLM call latency down to 800ms and then discover their indexing pipeline takes 4 hours whenever they update the knowledge base. The bottleneck was always the embedding step. If I had to pick one thing to benchmark early, it’s embedding throughput.

For most prototypes, Google’s text-embedding-004 is the right default. It’s free within generous API rate limits, the quality is good, and you can swap it later if you need higher recall on specific domains. When you swap embedding models, you have to re-embed your entire corpus. That’s usually not a painful migration on a small dataset and a painful one on a large dataset.

Also: the index type matters at scale. HNSW (Hierarchical Navigable Small World) is the standard choice for most vector databases. It’s an approximate nearest neighbor algorithm that trades a small amount of recall for very fast search. The “approximate” part means it doesn’t guarantee finding the single best match, but in practice it finds results in the top-0.1% by similarity in milliseconds at millions of documents. For exact nearest neighbor search on small datasets, flat indexes work; for anything over ~100K vectors, use HNSW.

When You Don’t Need a Vector Database

Not every LLM application needs a vector database. This is where the vector database explained scenario often gets over-applied.

You don’t need one for:

Summarization. You’re giving the model a document and asking for a shorter version. That’s entirely in-context.
Data extraction. Pulling structured information from receipts, emails, or forms. No retrieval needed.
Short conversations. If your context window comfortably holds the full conversation history plus system instructions, there’s nothing to retrieve.
Creative tasks. Story generation, copywriting, brainstorming. These don’t require external knowledge.
Classification or routing. Categorizing user inputs doesn’t need a knowledge base.

You need one when:

Your knowledge base is too large to fit in a context window.
The model needs information it wasn’t trained on: recent events, proprietary internal data, real-time updates.
Users ask unstructured questions against structured knowledge, like querying a 500-page product manual.
Hallucinations are a real risk and you want to ground responses in verified documents. Vector search gives you citations you can point to.

The question I ask is: “Is the model missing knowledge it needs to answer correctly?” If yes, you need retrieval. If the model already knows everything it needs, adding a vector database adds latency and complexity for no benefit. The AI Hallucinations explainer covers the four root causes of model errors in detail, which helps you figure out when retrieval fixes the problem and when it doesn’t.

The Index-Then-Query Flow

Every vector database application has two phases. Getting them in the right order matters.

Indexing phase (offline, runs once or on update):

Load your source documents (PDFs, Notion pages, database rows).
Split them into chunks (typically 256-512 tokens with 10-20% overlap so chunks don’t lose context at the boundary).
Embed each chunk using your embedding model.
Store the embedding plus the original text and metadata in the vector database.

Query phase (online, runs per user request):

Receive the user’s question.
Embed the question using the same embedding model.
Query the vector database for the top-k nearest chunks.
Inject those chunks into your LLM prompt.
Generate a response.

The chunk size decision in step 2 matters more than most tutorials mention. Small chunks (128-256 tokens) give precise retrieval but lose context. Large chunks (1024+ tokens) preserve context but return too much irrelevant information to the LLM. I’ve seen both extremes hurt retrieval quality in production. Most teams land at 512 tokens with overlap as a starting point and adjust based on retrieval quality. And I’d suggest measuring recall against a small test set before going to production, not after.

FAQ

Do I need a vector database for a basic RAG setup?

Yes, some form of one. Even a simple RAG system needs to store and search embeddings. For a prototype, you can use Chroma running in memory, which requires no server setup. For anything you’re deploying to real users, you want persistence: a self-hosted Chroma with a SQLite backend, pgvector on an existing Postgres instance, or a managed service like Pinecone. The “basic” RAG approach with in-memory Chroma is fine for learning and demos; it loses all your indexed data on restart.

Is pgvector good enough for production?

Yes, for most production workloads. pgvector with the HNSW index handles tens of millions of vectors on a decent Postgres instance and is running in production at companies shipping real products. The trade-off is that your database server handles both transactional queries and vector search simultaneously. Under heavy concurrent vector load, dedicated vector databases outperform pgvector. But that’s a scale problem most teams encounter well after launch, not before. Start with pgvector if you’re already on Postgres; migrate only when you have evidence you need to.

How much does it cost to index a large knowledge base?

The cost depends on your embedding model. Google’s text-embedding-004 is free up to 150 requests/minute and then $0.000025 per 1K characters (as of 2026). A 400-page document converted to text is roughly 200,000 tokens or 1M characters: under $0.03 to embed the entire thing. OpenAI’s text-embedding-3-large at $0.00013 per 1K tokens is more expensive but higher quality. For most initial builds, Google’s model is free enough that cost isn’t a factor. Re-embedding is where costs accumulate: if you index 1M documents and then switch embedding models, you pay to embed all 1M again.

Can I use a vector database without an LLM?

Yes. Vector databases are useful any time you need semantic search: product recommendation engines, image similarity search (using image embeddings), document deduplication, or cross-language search where you embed in a multilingual model. The LLM use case is the most discussed right now, but the underlying technology is general. You’re storing dense vector representations of things and finding similar things. The “things” don’t have to be text, and the goal doesn’t have to be RAG.

What’s the difference between vector search and keyword search?

Keyword search (like a SQL LIKE query or Elasticsearch default) matches documents that contain the exact words in your query. Vector search matches documents with similar meaning, regardless of word overlap. A keyword search for “purchase receipt” won’t find a document that says “payment confirmation” even though they mean the same thing. A vector search would find it, because the embeddings for those phrases are close together. The limitation of vector search is that it can retrieve semantically similar documents that are wrong for your specific query. Hybrid search, which combines keyword and vector results, often performs better than either alone in production. Weaviate and pgvector both support hybrid search natively.

Stop reading about vector databases. Try it. The first 50 exercises on TinkerLLM are free, no card needed.

Open the playground →

Vector Databases Explained: Why LLM Apps Need Them

TL;DR