- Vector databases store and index embeddings to enable fast semantic similarity search over unstructured data.
- They power NLP and RAG by acting as an external memory layer that combines vector distance with metadata filters.
- Dedicated engines, vector-enabled SQL databases and lightweight libraries like VDB cover different scale and control needs.
- ANN algorithms and distance metrics such as HNSW, L2 and cosine strongly influence precision, latency and resource usage.

This article walks through the vector database landscape with a special focus on lightweight, on‑prem options: what a vector DB actually is, how it differs from a plain vector index, how it powers NLP and RAG, which engines and extensions are worth considering (from Milvus and Qdrant to PostgreSQL pgvector and embedded libraries like VDB), and how distance metrics and ANN algorithms influence both quality and performance.
What is a vector database and why does it matter?
Traditional relational databases shine at structured data in rows and columns, but they struggle when you throw huge volumes of unstructured content at them. Loading PDFs, chat logs, images or sensor data into a classic SQL schema and then preparing them for AI is not only tedious, it is computationally inefficient when you need semantic similarity instead of exact matches.
Vector databases solve this by working directly with dense vectors instead of just tokens or keywords. Rather than asking “does this field contain the word smartphone?”, you ask “which stored vectors are closest to the query embedding?”, and the system returns semantically related items even if they do not share the exact same wording.
This shift from keyword matching to similarity in vector space is what enables semantic search, robust recommendations and powerful retrieval‑augmented generation (RAG). Companies can now combine their traditional business data with “semantic memory” in a single architecture, either via dedicated vector engines or by enabling vector types inside existing databases.
Vectors, embeddings and the problem they actually solve
At the core of any vector database are vectors: ordered lists of numbers that locate an item in a multi‑dimensional space. Each vector corresponds to an object – a sentence, a paragraph, an image, a product, a user profile – encoded along dozens, hundreds or even thousands of dimensions learned by a machine learning model.
Different embedding models define different vector spaces and dimensionalities. Some might output 384‑dimensional vectors, others 768 or more; as dimension grows, the representation can capture richer nuances but also becomes more challenging to index efficiently. Vector databases specialise in handling precisely this: long floating‑point vectors at scale.
The real pain they solve is the rigidity of traditional keyword search on unstructured data. A classic search for “smartphone” will miss documents that only mention “cell phone” or “mobile device”; typo‑tolerant keyword search helps a bit, but it still cannot truly understand that “mid‑century modern house with natural light” is a style, not a literal phrase you will find in every listing.
By storing embeddings, a vector database allows similarity search: queries and documents are both vectors, and closeness in that space stands in for semantic relatedness. That is why a search for “cell phone” can retrieve documents that only mention “smartphone”; their embeddings land in the same region of the space, even with different surface forms.
Vector index vs full vector database
It is useful to separate the idea of a “vector index” from that of a full‑blown vector database. Both deal with vectors, but they address different layers of the problem and come with different feature sets.
A vector index is a data structure optimised for nearest‑neighbour search. You give it a set of vectors and a query vector, and it tells you which stored items are closest. Libraries like FAISS are great at this; they implement efficient algorithms for approximate nearest neighbour (ANN) search and clustering, but they are not full database systems.
A vector database, in contrast, wraps those indexes with database capabilities such as metadata storage, schema management, security, resource management, concurrency control, failure recovery, and integration with broader data ecosystems. It is where organisations keep both embeddings and the original objects (or references to them), not just the index structures.
Enterprise‑ready vector databases also expose query languages and APIs that combine vector similarity with filters on structured attributes. You might query “documents similar to this paragraph, where project = X and created_at is within the last 30 days”, something that is hard to do cleanly with an index library alone.
Some modern relational systems have become “vector‑enabled databases” by adding native vector types. Oracle Database and MySQL, for example, now support vectors alongside classic numeric and text fields. That lets you keep business records and embeddings in one engine, avoiding consistency headaches between a separate vector store and your primary database.
How vector databases power NLP and generative AI
Semantic search is one of the most visible use cases. Instead of brittle keyword matching, you embed both the user query and all indexed documents, then retrieve those whose vectors are closest. The system can handle synonyms, paraphrases and even slightly off‑topic but contextually relevant phrasing, dramatically improving relevance over plain text search.
This semantic layer also reduces the impact of typos and noisy language. The user does not have to phrase a query perfectly; as long as the overall meaning is similar, the embedding model places the query near the correct documents and the vector DB surfaces them.
Efficient embedding management is another key role. Vector databases are optimised to store, index and retrieve huge volumes of text embeddings generated by large models; they let applications treat this as a fast, queryable “memory bank” that can be accessed in milliseconds, rather than a collection of files or ad‑hoc arrays in some application process. These embeddings generated by large models often rely on runtimes and accelerators to be practical at scale.
In practice, this shows up in several NLP applications: chatbots and AI assistants use vector DBs to look up relevant parts of prior conversations or documentation; Q&A systems convert documentation into embeddings and answer complex questions by retrieving and synthesising the right passages; sentiment and intent analysis benefit from richer semantic relationships encoded in the vectors; recommendation engines infer similarity between items and users based on their embedding space proximity.
Vector search in retrieval‑augmented generation (RAG)
Retrieval‑augmented generation (RAG) combines vector search with large language models to tame issues like hallucinations and stale knowledge. LLMs have a fixed training cutoff and cannot see your proprietary documents unless you explicitly provide them at inference time.
The typical RAG pipeline starts by chunking your knowledge base into smaller segments – for instance 200-500 words per chunk for text – and then encoding each chunk into an embedding vector using a chosen model. These vectors, together with metadata like titles, tags or source URLs, are stored in a vector database.
When a user asks a question, the system embeds the query with the same model and performs a similarity search against the stored embeddings. The top‑k closest chunks are assumed to be “about” the question and are retrieved in milliseconds, thanks to the DB’s ANN indexes.
The retrieved chunks are then prepended or otherwise injected into the LLM prompt. This is the “augmentation” part: the model receives both the original user request and several relevant pieces of external context, which helps it ground its answer in facts rather than guesswork.
Finally, the LLM generates a response conditioned on this retrieved context. Because the database content can be updated continuously, RAG allows LLMs to answer using up‑to‑date, domain‑specific information without retraining the model itself, and reduces hallucinations by anchoring outputs in actual documents.
How similarity search actually works
Under the hood, vector search is about comparing a query vector to many stored vectors and ranking them by a distance or similarity score. The challenge is doing this quickly and accurately when you have millions or billions of vectors in high dimensions.
The basic steps are consistent across engines. First, you vectorise your data: text, images, audio or other content are fed through an embedding model to produce vectors. Next, you store those vectors in the database, often together with IDs and metadata, and build one or more ANN indexes on top.
At query time, the user input is also embedded into a vector. The database then uses the index to find approximate nearest neighbours with respect to a chosen metric – cosine similarity, Euclidean distance, inner product or others – and returns the top matches along with their similarity scores.
Results are usually ranked by similarity score so that the closest vectors appear first. Many engines also support hybrid queries, where you filter by metadata (for example price range, location, category) while simultaneously optimising for vector similarity, giving you more business‑aware results.
To make all of this fast at scale, modern vector databases rely on approximate nearest neighbour algorithms. They trade a tiny bit of recall for huge improvements in speed and memory usage, which is acceptable for most real‑world AI applications.
Key ANN algorithms: HNSW, LSH and Product Quantization
Hierarchical Navigable Small World (HNSW) is one of the most widely used ANN algorithms in vector databases. It organises vectors into multiple graph layers: upper layers have few nodes and long‑range connections, while lower layers get denser, with all nodes connected in the bottom layer.
During search, HNSW starts from an entry point on the top layer and greedily walks towards closer neighbours, moving down layers as it refines the search. This layered graph structure yields an efficient balance between recall and latency, which is why HNSW powers engines like Milvus, Qdrant and others.
Locality‑Sensitive Hashing (LSH) takes a different approach, using hash functions that map similar vectors into the same buckets with high probability. Unlike traditional hashing that tries to avoid collisions, LSH embraces them for similar items. Multiple hash tables are built so that each query only needs to inspect candidates from matching buckets instead of the full dataset.
This effectively reduces dimensionality while preserving neighbourhood structure in a probabilistic way. LSH can be very attractive for high‑dimensional data when you need extremely fast candidate generation and can tolerate approximate results.
Product Quantization (PQ) focuses on compressing vectors to save memory and accelerate distance computations. It splits each high‑dimensional vector into several subvectors, then quantises each subspace separately and stores only the IDs of the closest centroids, forming a short code.
This compression can reduce memory usage by over 90% while still enabling distance estimation. Although PQ is lossy and may reduce search precision slightly, it is extremely powerful for massive collections where RAM is the main bottleneck, and is a staple in tools like FAISS and some vector DB backends.
Distance metrics: Euclidean vs cosine and friends
The quality of your vector search also depends heavily on the distance or similarity metric you choose. Two of the most common choices are Euclidean distance (L2) and cosine similarity (or its complement, cosine distance).
Euclidean distance measures the straight‑line distance between two points in n‑dimensional space. For vectors P and Q, it is the square root of the sum of squared coordinate differences. Shorter distance means greater similarity, and its range goes from 0 (identical vectors) to infinity.
This metric is sensitive to magnitude. If one vector is much longer than another – for example, representing a longer document or larger feature values – Euclidean distance will reflect that, even if both vectors point roughly in the same direction. It works well when absolute scale carries semantic meaning, e.g. physical coordinates or continuous numeric features where size matters.
Cosine similarity, in contrast, looks at the angle between two vectors, not their length. It is the dot product divided by the product of vector norms. Many practical systems use cosine distance = 1 − cosine similarity, where 0 means identical direction and larger values mean more dissimilarity.
Because it ignores magnitude, cosine similarity is ideal when orientation encodes semantics. In text applications, two documents on the same topic – one short and one long – should still be considered very similar; cosine makes that happen, whereas Euclidean distance might penalise the longer document just for having bigger counts.
In high‑dimensional, sparse spaces typical of NLP, cosine similarity tends to behave more robustly than Euclidean distance. The “curse of dimensionality” makes all Euclidean distances start to look similar in very high dimensions, which can reduce discriminative power. Cosine operates on the normalised vectors and often yields more meaningful similarity ordering for text embeddings.
Choosing a metric is ultimately about what you want “similarity” to mean in your domain. If scale is important – for example, anomaly detection based on magnitude of deviation – Euclidean can be appropriate. If thematic closeness or directional alignment matters more than length, cosine is typically the better fit. Some databases also expose inner product as a metric, which is closely related to cosine when vectors are normalised.
Popular vector databases and vector‑enabled systems
The ecosystem of vector storage options has exploded, ranging from fully managed cloud services to self‑hosted open‑source engines and library‑style solutions. The right choice depends on your scale, budget, operational constraints and how tightly you want to integrate with existing data infrastructure.
Dedicated vector databases are built from the ground up for high‑throughput similarity search. They usually support multiple ANN indexes, sophisticated compression schemes, rich metadata filtering and production‑grade clustering and failover.
Milvus is a prime example of a powerful open‑source vector database designed for large‑scale workloads. It targets machine learning, deep learning, similarity search and recommendation systems, and supports GPU acceleration, distributed queries and a variety of indexing methods such as IVF, HNSW and PQ.
This configurability lets you balance recall, latency and storage footprint according to your needs. Milvus is well‑suited for enterprises with billions of vectors, multilingual content and stringent performance requirements, and integrates smoothly into complex data platforms.
Other dedicated engines fill slightly different niches. Pinecone focuses on fully managed cloud deployments with tight SLAs and strong metadata capabilities; Weaviate offers an open‑source engine with GraphQL APIs, built‑in vectorisers and hybrid keyword + vector search; Qdrant provides a fast open‑source vector search service with advanced ANN methods and flexible filtering; Chroma targets simpler use cases and experimentation with an easy developer experience; Vespa excels at hybrid search and ranking that mix structured fields, text and vectors; Deep Lake concentrates on multimodal datasets like image and video where tight integration with ML frameworks is key.
At the same time, general‑purpose databases have started to adopt vector features rather than ceding the space completely. For organisations already invested in SQL or document stores, this can be a pragmatic way to add semantic search without standing up a separate system.
PostgreSQL with the pgvector extension is one of the most popular paths here. Pgvector introduces a VECTOR type that stores fixed‑dimension vectors directly in Postgres tables and exposes similarity operators for Euclidean distance, inner product and cosine distance.
That means you can create a table like embeddings(id SERIAL PRIMARY KEY, vector VECTOR(768)), index it, and then run queries of the form “give me the 5 closest vectors to ordered by L2 distance”, all in standard SQL. The extension supports indexes for reasonably high dimensions and plugs nicely into frameworks like LangChain.
The big upside of pgvector is simplicity and consolidation. Your transactional data, analytics tables and embeddings all live in one engine, with one backup and security story. The trade‑off is that Postgres is not purpose‑built for billion‑vector workloads, so at extreme scale or ultra‑low latency requirements, a dedicated vector DB will generally outperform it.
Elasticsearch and OpenSearch can also be turned into vector‑aware systems via k‑NN plugins. If your team already runs a search cluster for logs or full‑text, enabling vector fields might be enough to prototype semantic search without re‑architecting. MongoDB has joined the trend too, integrating vector search into its document‑oriented ecosystem for lighter‑weight use cases.
Embedded and lightweight options: VDB and on‑prem scenarios
Not every project needs (or can afford) a distributed, enterprise‑grade vector database. For many founders and teams building MVPs, research tools or on‑device applications, a lightweight, embedded library is far more attractive.
VDB is an example of such a lightweight solution: a header‑only C library that implements core vector search functionality. It ships under an Apache 2.0 licence and can be dropped directly into C or C++ applications without exotic dependencies besides optional pthreads for multithreading.
The core feature set covers what most early‑stage products need. VDB supports multiple similarity metrics (cosine, Euclidean, inner product), multithreaded search to exploit multi‑core CPUs, basic persistence so you can save and reload indexes from disk, and official Python bindings so you can integrate it into the typical AI stack.
Because it is header‑only, integration is about as simple as it gets: include the headers in your project, compile, generate embeddings with your favourite model (OpenAI, Cohere, Sentence Transformers, etc.), push them into VDB with associated IDs or metadata, and query for the top‑k nearest neighbours when serving requests.
This design plays really well with on‑premise or edge deployments. If you are building a LangChain + ChatGPT style app but want to keep everything behind your own firewall, an embedded library avoids external dependencies and vendor lock‑in. For IoT or edge devices where cloud latency is unacceptable, having the vector store compiled into your binary is a big win.
There are, of course, trade‑offs: VDB does not attempt to replace a full enterprise DB. It relies on exact (brute‑force) search rather than sophisticated ANN graphs or quantisation, so query time scales linearly with dataset size. For tens or even a few hundreds of thousands of vectors, that is often acceptable, especially with multithreading; for tens of millions, you will likely hit limits unless you shard or introduce your own indexing layer.
Real‑world hybrid search: joining vectors and metadata
In practice, almost every production use case combines vector similarity with strict filters on structured attributes. Users rarely want “the most similar thing in the whole corpus”; they want “similar, but also respecting these constraints”.
Consider a real‑estate search application where users describe the feel of a home – “mid‑century modern with lots of natural light” – while also requiring hard constraints like “3 bedrooms”, “under $800,000” and “in district A”. A plain vector search would happily return a gorgeous 2‑million‑dollar mid‑century villa in the wrong school district; plain SQL filters would never understand the style query.
Engines like AlloyDB for PostgreSQL illustrate how to address this with in‑line filtering. AlloyDB combines Postgres compatibility with Google’s scalable infrastructure, integrates pgvector as a first‑class extension, and augments it with a ScaNN‑based vector index for fast similarity search.
Its in‑line filtering means that the vector index and SQL metadata filters are applied in a single pass. Rather than doing a vector search, then filtering out non‑matching rows afterwards, AlloyDB checks numeric and categorical constraints as it traverses the vector index, avoiding wasted work and latency penalties.
The end result is a hybrid search that returns houses that match both aesthetic preferences and hard filters within milliseconds. This pattern generalises to e‑commerce (style + price + stock), content discovery (topic + language + region), and essentially any domain where “vibe” must coexist with strict business rules.
From embeddings to production applications
Once you have chosen a storage approach, the high‑level flow for building vector‑based features is reasonably consistent, whether you are using Milvus, Qdrant, PostgreSQL + pgvector, Elasticsearch k‑NN or a lightweight library like VDB.
First, you generate embeddings for your corpus. For text, that could be documentation, knowledge bases, tickets, emails or chat logs; for images and multimodal data, you would use suitable vision or multimodal models. Each item becomes a vector plus any metadata you care about.
Next, you store embeddings in the chosen vector store together with identifiers and metadata. In a vector DB, this usually means creating a collection or table with vector and metadata fields; in VDB, it might be an in‑memory index backed by on‑disk snapshots.
At query time, you embed the user input with the same model and issue a similarity search. The database returns the top‑k most similar vectors, and you look up the underlying items (documents, products, images) using their IDs or stored payloads.
For RAG, you pass the retrieved content as additional context to your LLM. For recommendation systems, you use the neighbours directly as candidates to rank. For analytics or anomaly detection, you can aggregate distances and neighbours to understand patterns and outliers.
Vector databases also make it easier to operationalise embedding models in a robust way. Instead of manually handling files or ad‑hoc arrays, you get proper resource management, scaling knobs, security controls and query languages that let you express complex similarity + filter queries cleanly. These operational concerns include monitoring, tracing and governance for production LLMs and vectors, as described in layers of AI observability.
When combined with generative AI, this stack enables experiences that feel personalised, grounded in your own data and capable of evolving as your corpus grows. Whether you choose a heavyweight distributed DB or a lightweight on‑prem library, the conceptual pieces – embeddings, similarity metrics, ANN or exact search, and metadata filters – remain the same and form the backbone of modern AI applications.
As AI systems become more conversational, multimodal and context‑hungry, the role of vector databases as a semantic memory layer will only deepen; understanding how vectors are stored, indexed and compared is fast becoming a core skill for anyone building serious applications with language and vision models.