Vector Search Internals: HNSW, IVF, and Product Quantization
HNSW: Hierarchical Navigable Small Worlds
Performing an exact nearest-neighbor search across millions of high-dimensional vectors is too slow for real-time applications. Hierarchical Navigable Small Worlds (HNSW) is a graph-based Approximate Nearest Neighbor (ANN) algorithm that achieves an excellent balance between search speed, recall, and memory usage. It builds on the principles of Navigable Small Worlds (NSW) and adds a critical enhancement: a hierarchical, multi-layer graph structure.
The Multi-Layer Architecture
HNSW builds multiple layers of graphs. The bottom layer (Layer 0) contains every data point in the index. Each layer above it holds a progressively smaller subset. The topmost layer has only a handful of nodes, serving as the global entry point.
Higher layers act as long-range expressways — they let the search jump quickly across the vector space. Lower layers provide fine-grained local navigation. The search always starts at the top and descends, zooming in on the target region layer by layer.
Each node exists in its assigned layer and all layers below it. A node assigned to Layer 2 is also present in Layers 1 and 0. Connections exist only within each layer. The search transitions downward by reusing the closest node found in layer l as the entry point for layer l - 1.
Graph Construction Algorithm
Building an HNSW index is an incremental process. Each new vector p is inserted
one at a time. The steps below walk through what happens when you insert a single point.
Step 1: Assign a Max Layer
A maximum layer L_max is drawn randomly using an exponentially decaying
probability: P(layer) ∝ e^(-layer / mL). Most points land in Layer 0
only. A small fraction reach Layer 1 or higher. This keeps upper layers sparse and
fast to traverse. The factor mL controls how sparse those upper layers
are. The default is mL = 1 / ln(M).
Step 2: Find the Entry Point
The algorithm starts at the designated entry point of the topmost existing layer.
For a new point being inserted into layers 0 through L_max, insertion
begins at the highest layer currently in the graph. If L_max exceeds the
current graph height, the new point becomes the new global entry point.
Step 3: Greedy Search (Top Down)
Starting from the entry point, the algorithm navigates greedily toward p
in the current layer — always moving to the neighbor that is closest to p.
For layers above L_max, only 1 nearest neighbor is tracked (no connections
are made). The closest node found in layer l becomes the entry point for
layer l - 1.
Step 4: Collect Candidates (efConstruction)
For layers at or below L_max, the search maintains a dynamic priority queue
of the closest candidates found so far. The size of this queue is controlled by
efConstruction. A larger value explores more neighbors, producing a
higher-quality graph (better recall at query time) at the cost of longer build time.
Typical values range from 64 to 512.
Step 5: Establish Connections (M)
From the candidate list, the algorithm selects neighbors for p and creates
bidirectional edges. The parameter M caps the maximum number of
outgoing
edges per node per layer. Layer 0 often uses Mmax0 = 2M to improve base
connectivity. If adding p would push a neighbor over its limit, that
neighbor drops its furthest existing connection.
Search Algorithm
Searching for the k nearest neighbors of a query vector q mirrors
insertion — but the graph is read-only. No connections are modified.
- Start at the top. Use the global entry point of the topmost layer.
- Greedy descent. In each layer, navigate greedily toward
q. Track only the single closest node found (for layers above Layer 0). - Refine at Layer 0. Use the closest node from Layer 1 as the entry point. Run a more exhaustive search controlled by efSearch — a candidate queue that tracks the best nodes seen so far.
- Return top k. Once the queue cannot be improved further, return the
kclosest vectors from the final candidate list.
efSearch directly controls the speed-recall tradeoff at query time. Adjust it below to see the effect on latency and recall:
efSearch Impact Simulator
How are these values calculated?
Recall@10 — modelled as a logarithmic saturation curve calibrated to published HNSW benchmarks (ANN-Benchmarks, SIFT-1M). Each additional candidate yields diminishing recall gains:
recall = min(99.8, 60 + 32 × log(efSearch / 10) / log(512 / 10))
Latency — each additional candidate adds a fixed per-node evaluation cost (edge traversal + distance computation). Modelled as linear:
latency_ms = 0.4 + efSearch × 0.028
Nodes Evaluated — each candidate causes ~6 neighbor expansions on average for a typical M=16 graph:
nodes_evaluated ≈ efSearch × 6.2
All values are illustrative approximations for a 1M-vector, 768-dim index with M=16, efConstruction=128 on commodity hardware. Actual numbers vary by dataset, hardware, and HNSW implementation.
efSearch must be ≥ k. If you request k=10 nearest neighbors, efSearch must be at least 10. Setting it lower produces undefined behavior in most implementations.
Core Parameters and Tuning
HNSW exposes four key parameters. Each controls a distinct dimension of the index quality. Flip the cards below to review each parameter's role and typical values.
M — Max Connections per Layer
Click to see impact
Controls graph density. Higher M improves recall and robustness but increases index size and build time significantly.
Typical range: 5 – 48
efConstruction — Build-time Candidate Queue
Click to see impact
Larger value produces a better-quality graph (higher recall at query time) at the cost of longer build time. Does not affect index size.
Typical range: 64 – 512
efSearch — Query-time Candidate Queue
Click to see impact
Directly trades search latency for recall. Must be ≥ k. Tuned at runtime without rebuilding the index.
Typical range: k – 512
mL — Level Normalization Factor
Click to see impact
Controls layer assignment probability. Smaller mL creates denser upper layers. Usually left at the default.
Default: 1 / ln(M)
Advantages and Disadvantages
| Dimension | Assessment | Rating |
|---|---|---|
| Search Speed | State-of-the-art QPS at high recall targets across most datasets | Excellent |
| Recall | Achieves 99%+ recall with high efSearch; tunable at runtime | Excellent |
| Memory Usage | Graph links add significant overhead — ~100 bytes per vector per connection | Moderate |
| Build Time | Slow at high efConstruction and M; scales super-linearly with dataset size | Moderate |
| Incremental Inserts | Supports adding new points without full rebuild | Good |
| Deletions | Marked as deleted (lazy); graph degrades over many deletes without compaction | Weak |
Inverted File Index (IVF) Variations
IVF takes a different approach from graph-based methods. It partitions the vector space into distinct regions using clustering, then restricts each search to only the most relevant regions. This dramatically reduces the number of distance calculations needed per query.
How IVF Works
During index construction, IVF runs k-means clustering on the dataset to produce
k centroids. Each data vector is then assigned to its nearest centroid, forming
k inverted lists (also called buckets or cells). At query time:
- The query vector
qis compared against allkcentroids. - The nprobe nearest centroids are selected.
- Only the vectors in those
nprobelists are searched with full distance calculations. - The closest results from those lists are returned.
Tune nprobe below to explore how it affects recall and search cost:
nprobe Impact Simulator
How are these values calculated?
Recall@10 — modelled as a logarithmic saturation curve. With 256 total lists, probing more lists captures more of the true nearest neighbors, but with diminishing returns:
recall = min(99.5, 55 + 38 × log(nprobe) / log(256))
Dataset Scanned — each probed list contains roughly 1/k of the total dataset. With 256 lists:
coverage_% = (nprobe / 256) × 100
Latency — two costs sum up: scanning centroids to find the nearest nprobe (fixed cost), and scanning vectors inside those lists (linear in nprobe):
latency_ms = 0.3 + nprobe × 0.31
All values are illustrative approximations for a 1M-vector, 768-dim IVFFlat index with 256 centroids on commodity hardware. Actual numbers vary by dataset, hardware, and implementation.
IVFPQ: Combining IVF with Product Quantization
IVFFlat stores full float32 vectors inside each inverted list. For large datasets, this is memory-prohibitive. IVFPQ addresses this by compressing the vectors inside each list using Product Quantization (PQ). Only compact PQ codes are stored, not original vectors.
At query time, the search inside each probed list uses Asymmetric Distance Computation (ADC): the query vector stays full-precision, while distances to database vectors are approximated via precomputed lookup tables over the PQ codes. ADC is significantly faster than full-vector distance calculation.
IVFADC is another name for IVFPQ, explicitly calling out the Asymmetric Distance Computation step. Symmetric Distance Computation (SDC) also quantizes the query before comparison — faster, but less accurate than ADC. ADC is the default in most production systems.
Choosing Between IVFFlat and IVFPQ
| Criteria | IVFFlat | IVFPQ |
|---|---|---|
| Memory usage | High (full vectors) | Low (PQ codes, 10–100x smaller) |
| Search accuracy | Exact within probed lists | Approximate (quantization error) |
| Search speed | Slower inner-list scan | Faster via ADC lookup tables |
| Complexity | Simple | Requires PQ training step |
| Best for | Moderate datasets, high accuracy needed | Very large datasets, memory-constrained |
Product Quantization (PQ) Mechanics
A 768-dimensional float32 vector requires 3,072 bytes. Storing 100 million such vectors takes nearly 300 GB of RAM. Product Quantization (PQ) compresses these vectors to a few bytes each while preserving enough information for fast approximate distance calculations.
Vector Decomposition
PQ splits a D-dimensional vector x into M non-overlapping
sub-vectors, each of dimension D/M. For example, a 768-dim vector with M=48
produces 48 sub-vectors of dimension 16 each.
# D=768, M=48, k*=256 (1 byte per code)
import numpy as np
from sklearn.cluster import MiniBatchKMeans
D, M, K_STAR = 768, 48, 256
sub_dim = D // M # 16
# Train M independent codebooks on training data
codebooks = []
for m in range(M):
sub_vectors = train_data[:, m * sub_dim : (m+1) * sub_dim]
kmeans = MiniBatchKMeans(n_clusters=K_STAR, random_state=42)
kmeans.fit(sub_vectors)
codebooks.append(kmeans)
# Encode a single vector x -> M bytes
def pq_encode(x):
code = np.empty(M, dtype=np.uint8)
for m in range(M):
sub = x[m * sub_dim : (m+1) * sub_dim]
code[m] = codebooks[m].predict([sub])[0]
return code # 48 bytes vs. 3072 bytes original → 64x compression
Subspace Quantization and Codebooks
For each of the M subspaces, PQ trains a separate codebook of k* centroids
using k-means. Each sub-vector is then replaced by the index (0–255 when k*=256) of its
nearest centroid in that subspace's codebook. The entire original vector reduces to
M bytes — 48 bytes for the 768-dim example, versus the original 3,072 bytes.
This is a 64x compression ratio.
Approximate Distance Calculation
The key performance gain of PQ is not just storage compression — it is fast distance
approximation. For a query vector q, PQ precomputes a lookup table for
each of the M subspaces: the distance from q's sub-vector to every one
of the k* centroids in that subspace.
This creates M tables, each with k* entries. Approximating the distance to any database vector then costs exactly M table lookups and M-1 additions — no floating-point multiplication over D dimensions. For D=768 and M=48, this replaces 768 multiply-add operations with 48 integer lookups.
ADC vs SDC. Asymmetric Distance Computation (ADC) keeps the query full-precision. Symmetric Distance Computation (SDC) also quantizes the query, enabling all-centroid precomputation, but adds quantization error to the query side. SDC is faster but less accurate. ADC is the default in FAISS and most production systems.
Trade-offs
PQ introduces a tunable accuracy-compression tradeoff via two parameters:
- M (number of subspaces): More subspaces means shorter sub-vectors, finer quantization, and higher recall — but the code length grows (M bytes per vector) and codebook training time increases linearly.
- k* (centroids per codebook): More centroids means finer resolution within each subspace. k*=256 is the standard choice — it fits each code in one byte (8 bits), which is byte-aligned and SIMD-friendly. Going to k*=512 needs 9 bits and complicates packing.
PQ is not an indexing structure. It does not provide sub-linear search on its own. You always combine it with an index: IVFPQ uses PQ to compress vectors inside IVF lists, and HNSW+PQ compresses vectors stored at graph nodes. PQ handles memory and speed. HNSW or IVF handles structure and navigation.
Other Graph-Based ANN Methods
HNSW is not the only graph-based ANN algorithm. NSG and Vamana represent distinct design philosophies for building proximity graphs, each targeting different constraints.
NSG: Navigating Spreading-Out Graphs
NSG focuses on building a graph with strong directional coverage. Rather than connecting each node purely to its nearest neighbors, NSG selects neighbors that provide diverse angular coverage relative to the node — the "spreading-out" property. This prevents the graph from clustering too tightly around local neighborhoods, which would trap greedy search in local optima.
Construction starts from an initial approximate k-NN graph. A designated navigation node serves as the global entry point for all searches. For each node, NSG iteratively selects neighbors by checking whether a candidate improves reachability from the navigation node. After construction, a refinement pass ensures global connectivity.
NSG can achieve high recall with compact graphs — fewer edges than HNSW at equivalent recall targets in some benchmarks. The tradeoff: construction is more complex and the single navigation node can create hotspot congestion under high concurrency.
Vamana Algorithm
Vamana is the core index behind DiskANN, designed for datasets too large to fit in RAM. It explicitly optimizes the graph structure for greedy search performance, not just proximity.
Construction is iterative. Vamana starts with a random graph of fixed out-degree, then
runs multiple refinement passes. In each pass, it simulates a greedy search for each
node and uses a RobustPrune heuristic to update neighborhood sets.
RobustPrune rejects a candidate neighbor p_c if an existing neighbor
p_n is closer to the target than p_c is — preventing
redundant short-range edges in favor of long-range navigational edges.
Because Vamana targets disk storage, it batches neighbor list reads to minimize random I/O. The resulting graph achieves state-of-the-art recall at billion-scale datasets that reside entirely on SSDs.
Choosing Between Graph Methods
| Dimension | HNSW | NSG | Vamana |
|---|---|---|---|
| Structure | Multi-layer hierarchy | Flat, diverse neighbors | Flat, iteratively optimized |
| Build approach | Single-pass, incremental | Refinement from k-NN graph | Multi-pass optimization |
| Build time | Moderate | Varies by heuristic | Slow (multiple passes) |
| Memory | High (multi-layer links) | Lower (flat, sparse) | Disk-resident; low RAM |
| Best for | In-memory, general purpose | Memory-sensitive in-memory | Billion-scale disk datasets |
| Ecosystem maturity | Excellent (FAISS, hnswlib, Weaviate, Qdrant) | Research / niche | DiskANN, Azure AI Search |
Selecting the Right ANN Algorithm
No single algorithm wins across all dimensions. Your choice must be driven by your specific constraints: dataset size, memory budget, recall target, latency budget, and update frequency.
Core Trade-off Dimensions
- Recall: What fraction of true top-k neighbors do you need? 95%? 99%?
- Latency: Do you need sub-millisecond P99, or is 10ms acceptable?
- Memory: Can the entire index fit in RAM, or must it live on disk?
- Build time: Is index construction a one-time cost or a frequent event?
- Updateability: Do vectors change frequently? Are deletions common?
Algorithm Characteristics at a Glance
Best recall-speed tradeoff in-memory. High RAM cost. Good for incremental inserts. Weak on deletes.
Simple and accurate within probed cells. Memory scales with dataset size. Needs nprobe tuning.
Essential when RAM is constrained. 10–100x memory reduction. Accepts quantization recall loss.
Billion-scale disk-resident search. High recall at low RAM. Requires SSD with good random read latency.
Compact flat graph. Lower memory than HNSW at similar recall. Less ecosystem support.
Perfect recall by definition. Linear scan. Only viable below ~50K vectors or with GPU brute-force.
Making the Decision
- Define your recall floor. If you need 99%+ recall, HNSW with high efSearch is your starting point. If 90% recall is acceptable, IVF variants open up.
- Check your memory budget. If the index must fit under a fixed RAM
ceiling, calculate index size first. For HNSW: roughly
M * 8 bytes * Nbytes in graph links alone, plus vector storage. IVFPQ compresses vectors to M_pq bytes each and has no graph overhead. - Check your dataset size. Below 1M vectors, HNSW in-memory is almost always correct. Above 100M vectors, consider IVFPQ or DiskANN.
- Benchmark on your data. Theoretical comparisons guide the starting point. Only benchmarks on your actual query distribution and vectors reveal the true recall-latency curve.
Filtering and updates matter. If your queries include metadata filters (e.g., "similar to X but only from tenant Y"), pre-filtering with IVF's natural partitioning is often more efficient than post-filtering HNSW results. For high-update workloads, consider indexes that support real-time compaction or segment-based structures rather than a single monolithic HNSW graph.