Vector Search Internals: HNSW, IVF, and Product Quantization

HNSW IVF Product Quantization ANN Vector Search RAG

HNSW: Hierarchical Navigable Small Worlds

Performing an exact nearest-neighbor search across millions of high-dimensional vectors is too slow for real-time applications. Hierarchical Navigable Small Worlds (HNSW) is a graph-based Approximate Nearest Neighbor (ANN) algorithm that achieves an excellent balance between search speed, recall, and memory usage. It builds on the principles of Navigable Small Worlds (NSW) and adds a critical enhancement: a hierarchical, multi-layer graph structure.

The Multi-Layer Architecture

HNSW builds multiple layers of graphs. The bottom layer (Layer 0) contains every data point in the index. Each layer above it holds a progressively smaller subset. The topmost layer has only a handful of nodes, serving as the global entry point.

Higher layers act as long-range expressways — they let the search jump quickly across the vector space. Lower layers provide fine-grained local navigation. The search always starts at the top and descends, zooming in on the target region layer by layer.

A simplified three-layer HNSW graph. Layer 2 (top, blue) holds the global entry point E. Layer 1 (green) holds a sparse subset with long-range links. Layer 0 (purple) holds all data points with dense local connections. Search descends from the top and refines at each layer.

Each node exists in its assigned layer and all layers below it. A node assigned to Layer 2 is also present in Layers 1 and 0. Connections exist only within each layer. The search transitions downward by reusing the closest node found in layer l as the entry point for layer l - 1.

Graph Construction Algorithm

Building an HNSW index is an incremental process. Each new vector p is inserted one at a time. The steps below walk through what happens when you insert a single point.

Step 1: Assign a Max Layer

A maximum layer L_max is drawn randomly using an exponentially decaying probability: P(layer) ∝ e^(-layer / mL). Most points land in Layer 0 only. A small fraction reach Layer 1 or higher. This keeps upper layers sparse and fast to traverse. The factor mL controls how sparse those upper layers are. The default is mL = 1 / ln(M).

Step 2: Find the Entry Point

The algorithm starts at the designated entry point of the topmost existing layer. For a new point being inserted into layers 0 through L_max, insertion begins at the highest layer currently in the graph. If L_max exceeds the current graph height, the new point becomes the new global entry point.

Step 3: Greedy Search (Top Down)

Starting from the entry point, the algorithm navigates greedily toward p in the current layer — always moving to the neighbor that is closest to p. For layers above L_max, only 1 nearest neighbor is tracked (no connections are made). The closest node found in layer l becomes the entry point for layer l - 1.

Step 4: Collect Candidates (efConstruction)

For layers at or below L_max, the search maintains a dynamic priority queue of the closest candidates found so far. The size of this queue is controlled by efConstruction. A larger value explores more neighbors, producing a higher-quality graph (better recall at query time) at the cost of longer build time. Typical values range from 64 to 512.

Step 5: Establish Connections (M)

From the candidate list, the algorithm selects neighbors for p and creates bidirectional edges. The parameter M caps the maximum number of outgoing edges per node per layer. Layer 0 often uses Mmax0 = 2M to improve base connectivity. If adding p would push a neighbor over its limit, that neighbor drops its furthest existing connection.

Search Algorithm

Searching for the k nearest neighbors of a query vector q mirrors insertion — but the graph is read-only. No connections are modified.

Start at the top. Use the global entry point of the topmost layer.
Greedy descent. In each layer, navigate greedily toward q. Track only the single closest node found (for layers above Layer 0).
Refine at Layer 0. Use the closest node from Layer 1 as the entry point. Run a more exhaustive search controlled by efSearch — a candidate queue that tracks the best nodes seen so far.
Return top k. Once the queue cannot be improved further, return the k closest vectors from the final candidate list.

efSearch directly controls the speed-recall tradeoff at query time. Adjust it below to see the effect on latency and recall:

efSearch Impact Simulator

efSearch 64

91.2%

Estimated Recall@10

1.8ms

Estimated Latency

~384

Nodes Evaluated

How are these values calculated?

Recall@10 — modelled as a logarithmic saturation curve calibrated to published HNSW benchmarks (ANN-Benchmarks, SIFT-1M). Each additional candidate yields diminishing recall gains:

formula

recall = min(99.8, 60 + 32 × log(efSearch / 10) / log(512 / 10))

Latency — each additional candidate adds a fixed per-node evaluation cost (edge traversal + distance computation). Modelled as linear:

formula

latency_ms = 0.4 + efSearch × 0.028

Nodes Evaluated — each candidate causes ~6 neighbor expansions on average for a typical M=16 graph:

formula

nodes_evaluated ≈ efSearch × 6.2

All values are illustrative approximations for a 1M-vector, 768-dim index with M=16, efConstruction=128 on commodity hardware. Actual numbers vary by dataset, hardware, and HNSW implementation.

efSearch must be ≥ k. If you request k=10 nearest neighbors, efSearch must be at least 10. Setting it lower produces undefined behavior in most implementations.

Core Parameters and Tuning

HNSW exposes four key parameters. Each controls a distinct dimension of the index quality. Flip the cards below to review each parameter's role and typical values.

Parameter

M — Max Connections per Layer

Click to see impact

Impact

Controls graph density. Higher M improves recall and robustness but increases index size and build time significantly.

Typical range: 5 – 48

Parameter

efConstruction — Build-time Candidate Queue

Click to see impact

Impact

Larger value produces a better-quality graph (higher recall at query time) at the cost of longer build time. Does not affect index size.

Typical range: 64 – 512

Parameter

efSearch — Query-time Candidate Queue

Click to see impact

Impact

Directly trades search latency for recall. Must be ≥ k. Tuned at runtime without rebuilding the index.

Typical range: k – 512

Parameter

mL — Level Normalization Factor

Click to see impact

Impact

Controls layer assignment probability. Smaller mL creates denser upper layers. Usually left at the default.

Default: 1 / ln(M)

You need to improve recall without rebuilding the index. Which parameter do you adjust?

efConstruction

efSearch

Correct. efSearch is the only HNSW parameter tunable at query time. M and efConstruction are fixed at index-build time.

Not quite. M and efConstruction are set during index construction. Changing them requires a full rebuild. efSearch can be adjusted per query.

Tip: efSearch is often adjusted via an environment variable or request parameter in production.

Advantages and Disadvantages

Dimension	Assessment	Rating
Search Speed	State-of-the-art QPS at high recall targets across most datasets	Excellent
Recall	Achieves 99%+ recall with high efSearch; tunable at runtime	Excellent
Memory Usage	Graph links add significant overhead — ~100 bytes per vector per connection	Moderate
Build Time	Slow at high efConstruction and M; scales super-linearly with dataset size	Moderate
Incremental Inserts	Supports adding new points without full rebuild	Good
Deletions	Marked as deleted (lazy); graph degrades over many deletes without compaction	Weak

Inverted File Index (IVF) Variations

IVF takes a different approach from graph-based methods. It partitions the vector space into distinct regions using clustering, then restricts each search to only the most relevant regions. This dramatically reduces the number of distance calculations needed per query.

How IVF Works

During index construction, IVF runs k-means clustering on the dataset to produce k centroids. Each data vector is then assigned to its nearest centroid, forming k inverted lists (also called buckets or cells). At query time:

The query vector q is compared against all k centroids.
The nprobe nearest centroids are selected.
Only the vectors in those nprobe lists are searched with full distance calculations.
The closest results from those lists are returned.

IVF index structure. Vectors (circles) are grouped under their nearest centroid (diamonds C1–C3). A query (star) is first compared to all centroids, then only the matching list (L2, highlighted) is searched when nprobe=1.

Tune nprobe below to explore how it affects recall and search cost:

nprobe Impact Simulator

nprobe (total lists = 256) 8

82.4%

Estimated Recall@10

3.1%

Dataset Scanned

2.4ms

Estimated Latency

How are these values calculated?

Recall@10 — modelled as a logarithmic saturation curve. With 256 total lists, probing more lists captures more of the true nearest neighbors, but with diminishing returns:

formula

recall = min(99.5, 55 + 38 × log(nprobe) / log(256))

Dataset Scanned — each probed list contains roughly 1/k of the total dataset. With 256 lists:

formula

coverage_% = (nprobe / 256) × 100

Latency — two costs sum up: scanning centroids to find the nearest nprobe (fixed cost), and scanning vectors inside those lists (linear in nprobe):

formula

latency_ms = 0.3 + nprobe × 0.31

All values are illustrative approximations for a 1M-vector, 768-dim IVFFlat index with 256 centroids on commodity hardware. Actual numbers vary by dataset, hardware, and implementation.

IVFPQ: Combining IVF with Product Quantization

IVFFlat stores full float32 vectors inside each inverted list. For large datasets, this is memory-prohibitive. IVFPQ addresses this by compressing the vectors inside each list using Product Quantization (PQ). Only compact PQ codes are stored, not original vectors.

At query time, the search inside each probed list uses Asymmetric Distance Computation (ADC): the query vector stays full-precision, while distances to database vectors are approximated via precomputed lookup tables over the PQ codes. ADC is significantly faster than full-vector distance calculation.

IVFADC is another name for IVFPQ, explicitly calling out the Asymmetric Distance Computation step. Symmetric Distance Computation (SDC) also quantizes the query before comparison — faster, but less accurate than ADC. ADC is the default in most production systems.

Choosing Between IVFFlat and IVFPQ

Criteria	IVFFlat	IVFPQ
Memory usage	High (full vectors)	Low (PQ codes, 10–100x smaller)
Search accuracy	Exact within probed lists	Approximate (quantization error)
Search speed	Slower inner-list scan	Faster via ADC lookup tables
Complexity	Simple	Requires PQ training step
Best for	Moderate datasets, high accuracy needed	Very large datasets, memory-constrained

In IVFPQ, what does "Asymmetric" in Asymmetric Distance Computation refer to?

The number of centroids differs between query and database

The query is full-precision while database vectors are PQ-compressed codes

Different nprobe values are used for odd and even lists

The distance metric differs between layers

Correct. ADC keeps the query vector uncompressed for higher accuracy, while only database vectors are stored as compact PQ codes. The asymmetry is between the query representation and the database representation.

Not quite. "Asymmetric" describes the mismatch between representations: full query vector vs. quantized database codes. SDC would quantize both sides.

Tip: SDC quantizes the query too, enabling centroid-to-centroid precomputation, but at the cost of an extra quantization error on the query side.

Product Quantization (PQ) Mechanics

A 768-dimensional float32 vector requires 3,072 bytes. Storing 100 million such vectors takes nearly 300 GB of RAM. Product Quantization (PQ) compresses these vectors to a few bytes each while preserving enough information for fast approximate distance calculations.

Vector Decomposition

PQ splits a D-dimensional vector x into M non-overlapping sub-vectors, each of dimension D/M. For example, a 768-dim vector with M=48 produces 48 sub-vectors of dimension 16 each.

pq_encode.py

# D=768, M=48, k*=256 (1 byte per code)
import numpy as np
from sklearn.cluster import MiniBatchKMeans

D, M, K_STAR = 768, 48, 256
sub_dim = D // M  # 16

# Train M independent codebooks on training data
codebooks = []
for m in range(M):
    sub_vectors = train_data[:, m * sub_dim : (m+1) * sub_dim]
    kmeans = MiniBatchKMeans(n_clusters=K_STAR, random_state=42)
    kmeans.fit(sub_vectors)
    codebooks.append(kmeans)

# Encode a single vector x -> M bytes
def pq_encode(x):
    code = np.empty(M, dtype=np.uint8)
    for m in range(M):
        sub = x[m * sub_dim : (m+1) * sub_dim]
        code[m] = codebooks[m].predict([sub])[0]
    return code  # 48 bytes vs. 3072 bytes original → 64x compression

Subspace Quantization and Codebooks

For each of the M subspaces, PQ trains a separate codebook of k* centroids using k-means. Each sub-vector is then replaced by the index (0–255 when k*=256) of its nearest centroid in that subspace's codebook. The entire original vector reduces to M bytes — 48 bytes for the 768-dim example, versus the original 3,072 bytes. This is a 64x compression ratio.

The PQ encoding pipeline. A D-dimensional vector is split into M sub-vectors. Each subspace has a separate k-means codebook. The vector is stored as M byte-sized centroid IDs. At query time, Asymmetric Distance Computation (ADC) uses precomputed lookup tables to approximate the full distance with only M table lookups.

Approximate Distance Calculation

The key performance gain of PQ is not just storage compression — it is fast distance approximation. For a query vector q, PQ precomputes a lookup table for each of the M subspaces: the distance from q's sub-vector to every one of the k* centroids in that subspace.

This creates M tables, each with k* entries. Approximating the distance to any database vector then costs exactly M table lookups and M-1 additions — no floating-point multiplication over D dimensions. For D=768 and M=48, this replaces 768 multiply-add operations with 48 integer lookups.

ADC vs SDC. Asymmetric Distance Computation (ADC) keeps the query full-precision. Symmetric Distance Computation (SDC) also quantizes the query, enabling all-centroid precomputation, but adds quantization error to the query side. SDC is faster but less accurate. ADC is the default in FAISS and most production systems.

Trade-offs

PQ introduces a tunable accuracy-compression tradeoff via two parameters:

M (number of subspaces): More subspaces means shorter sub-vectors, finer quantization, and higher recall — but the code length grows (M bytes per vector) and codebook training time increases linearly.
k* (centroids per codebook): More centroids means finer resolution within each subspace. k*=256 is the standard choice — it fits each code in one byte (8 bits), which is byte-aligned and SIMD-friendly. Going to k*=512 needs 9 bits and complicates packing.

PQ is not an indexing structure. It does not provide sub-linear search on its own. You always combine it with an index: IVFPQ uses PQ to compress vectors inside IVF lists, and HNSW+PQ compresses vectors stored at graph nodes. PQ handles memory and speed. HNSW or IVF handles structure and navigation.

Other Graph-Based ANN Methods

HNSW is not the only graph-based ANN algorithm. NSG and Vamana represent distinct design philosophies for building proximity graphs, each targeting different constraints.

NSG: Navigating Spreading-Out Graphs

NSG focuses on building a graph with strong directional coverage. Rather than connecting each node purely to its nearest neighbors, NSG selects neighbors that provide diverse angular coverage relative to the node — the "spreading-out" property. This prevents the graph from clustering too tightly around local neighborhoods, which would trap greedy search in local optima.

Construction starts from an initial approximate k-NN graph. A designated navigation node serves as the global entry point for all searches. For each node, NSG iteratively selects neighbors by checking whether a candidate improves reachability from the navigation node. After construction, a refinement pass ensures global connectivity.

NSG can achieve high recall with compact graphs — fewer edges than HNSW at equivalent recall targets in some benchmarks. The tradeoff: construction is more complex and the single navigation node can create hotspot congestion under high concurrency.

Vamana Algorithm

Vamana is the core index behind DiskANN, designed for datasets too large to fit in RAM. It explicitly optimizes the graph structure for greedy search performance, not just proximity.

Construction is iterative. Vamana starts with a random graph of fixed out-degree, then runs multiple refinement passes. In each pass, it simulates a greedy search for each node and uses a RobustPrune heuristic to update neighborhood sets. RobustPrune rejects a candidate neighbor p_c if an existing neighbor p_n is closer to the target than p_c is — preventing redundant short-range edges in favor of long-range navigational edges.

Because Vamana targets disk storage, it batches neighbor list reads to minimize random I/O. The resulting graph achieves state-of-the-art recall at billion-scale datasets that reside entirely on SSDs.

Choosing Between Graph Methods

Dimension	HNSW	NSG	Vamana
Structure	Multi-layer hierarchy	Flat, diverse neighbors	Flat, iteratively optimized
Build approach	Single-pass, incremental	Refinement from k-NN graph	Multi-pass optimization
Build time	Moderate	Varies by heuristic	Slow (multiple passes)
Memory	High (multi-layer links)	Lower (flat, sparse)	Disk-resident; low RAM
Best for	In-memory, general purpose	Memory-sensitive in-memory	Billion-scale disk datasets
Ecosystem maturity	Excellent (FAISS, hnswlib, Weaviate, Qdrant)	Research / niche	DiskANN, Azure AI Search

Selecting the Right ANN Algorithm

No single algorithm wins across all dimensions. Your choice must be driven by your specific constraints: dataset size, memory budget, recall target, latency budget, and update frequency.

Core Trade-off Dimensions

Recall: What fraction of true top-k neighbors do you need? 95%? 99%?
Latency: Do you need sub-millisecond P99, or is 10ms acceptable?
Memory: Can the entire index fit in RAM, or must it live on disk?
Build time: Is index construction a one-time cost or a frequent event?
Updateability: Do vectors change frequently? Are deletions common?

Algorithm Characteristics at a Glance

HNSW

Best recall-speed tradeoff in-memory. High RAM cost. Good for incremental inserts. Weak on deletes.

IVFFlat

Simple and accurate within probed cells. Memory scales with dataset size. Needs nprobe tuning.

IVFPQ

Essential when RAM is constrained. 10–100x memory reduction. Accepts quantization recall loss.

Vamana / DiskANN

Billion-scale disk-resident search. High recall at low RAM. Requires SSD with good random read latency.

NSG

Compact flat graph. Lower memory than HNSW at similar recall. Less ecosystem support.

Exact k-NN

Perfect recall by definition. Linear scan. Only viable below ~50K vectors or with GPU brute-force.

Making the Decision

Define your recall floor. If you need 99%+ recall, HNSW with high efSearch is your starting point. If 90% recall is acceptable, IVF variants open up.
Check your memory budget. If the index must fit under a fixed RAM ceiling, calculate index size first. For HNSW: roughly M * 8 bytes * N bytes in graph links alone, plus vector storage. IVFPQ compresses vectors to M_pq bytes each and has no graph overhead.
Check your dataset size. Below 1M vectors, HNSW in-memory is almost always correct. Above 100M vectors, consider IVFPQ or DiskANN.
Benchmark on your data. Theoretical comparisons guide the starting point. Only benchmarks on your actual query distribution and vectors reveal the true recall-latency curve.

Your dataset has 500M vectors and your RAM budget is 32GB. Which index type is the most appropriate starting point?

HNSW with M=16

IVFFlat with nprobe=32

IVFPQ or DiskANN

Exact k-NN brute-force

Correct. 500M float32 vectors at 768 dims = ~1.4TB raw. HNSW adds graph overhead on top of that. Only IVFPQ (which can compress to tens of bytes per vector) or DiskANN (disk-resident) can realistically operate under a 32GB RAM budget at this scale.

Not quite. At 500M vectors with 768 dimensions, raw storage alone exceeds 1TB. HNSW and IVFFlat both require the full vectors in RAM. Only PQ compression or disk-based indexing (DiskANN) fits within a 32GB budget.

Tip: Calculate raw vector memory first: N * D * 4 bytes. Then add index overhead. This single check often eliminates most algorithm candidates immediately.

Filtering and updates matter. If your queries include metadata filters (e.g., "similar to X but only from tenant Y"), pre-filtering with IVF's natural partitioning is often more efficient than post-filtering HNSW results. For high-update workloads, consider indexes that support real-time compaction or segment-based structures rather than a single monolithic HNSW graph.