PDX: A Library for Fast Vector Search and Indexing

Index millions of vectors in seconds. Search them in milliseconds.

Why PDX?

⚡ 30x faster index building thanks to SuperKMeans.
⚡ Sub-millisecond similarity search, up to 10x faster than FAISS IVF.
⚡ Up to 30x faster exhaustive search.
🔍 Efficient filtered search.
Query latency competitive with HNSW, with the ease of use of IVF.

Our secret sauce

PDX is a data layout that transposes vectors in a column-major order. This layout unleashes the true potential of dimension pruning.

Pruning means avoiding checking all the dimensions of a vector to determine if it is a neighbour of a query, accelerating index construction and similarity search by factors.

Use Cases and Benchmarking

Check ./BENCHMARKING.md.

Usage

from pdxearch import IndexPDXIVFTreeSQ8

data = ... # Numpy 2D matrix
query = ... # Numpy 1D array
d = 1024
knn = 20

index = IndexPDXIVFTreeSQ8(num_dimensions=d)
index.build(data)

ids, dists = index.search(query, knn)

IndexPDXIVFTreeSQ8 is our fastest index that will give you the best performance. It is a two-level IVF index with 8-bit quantization.

Check our examples for fully working examples in Python and our benchmarks for fully working examples in C++. We support Flat (float32) and Quantized (8-bit) indexes, as well as the most common distance metrics.

Installation

We provide Python bindings for ease of use. Soon, we will be available on PyPI.

Prerequisites

Clang 17, CMake 3.26
OpenMP
A BLAS implementation
Python 3 (only for Python bindings)

Installation Steps

git clone --recurse-submodules https://github.com/cwida/PDX
cd PDX

pip install .

# Run the examples under `/examples`
# pdx_simple.py creates an IVF index with FAISS on random data
# Then, it compares the search performance of PDX and FAISS
python ./examples/pdx_simple.py

For a more comprehensive installation and compilation guide, check INSTALL.md.

Getting the Best Performance

Check INSTALL.md.

Roadmap

We are actively developing Super K-Means and accepting contributions! Check CONTRIBUTING.md

The Data Layout

PDX is a transposed layout (a.k.a. columnar, or decomposed layout), meaning that the dimensions of different vectors are stored sequentially. This decomposition occurs within a block (e.g., a cluster in an IVF index).

We have evolved our layout from the one presented in our publication to reduce random access, and adapted it to work with 8-bit and (in the future) 1-bit vectors.

`float32`

For float32, the first 25% of the dimensions are fully decomposed. We refer to this as the "vertical block." The rest (75%) are decomposed into subvectors of 64 dimensions. We refer to this as the "horizontal block." The vertical block is used for efficient pruning, and the horizontal block is accessed on the candidates that were not pruned. This horizontal block is still decomposed every 64 dimensions. The idea behind this is that we still have a chance to prune the few remaining candidates every 64 dimensions.

The following image shows this layout. Storage is sequential from left to right, and from top to bottom.

`8 bits`

Smaller data types are not friendly to PDX, as we must accumulate distances on wider types, resulting in asymmetry. We can work around this by changing the PDX layout. For 8 bits, the vertical block is decomposed every 4 dimensions. This allows us to use dot-product instructions (VPDPBUSD on x86 and UDOT/SDOT on NEON) to calculate L2 or IP kernels while still benefiting from PDX. The horizontal block remains decomposed every 64 dimensions.

Citation

If you use PDX for your research, consider citing us:

@article{kuffo2025pdx,
  title={PDX: A Data Layout for Vector Similarity Search},
  author={Kuffo, Leonardo and Krippner, Elena and Boncz, Peter},
  journal={Proceedings of the ACM on Management of Data},
  volume={3},
  number={3},
  pages={1--26},
  year={2025},
  publisher={ACM New York, NY, USA}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
extern		extern
include/pdx		include/pdx
python		python
scripts		scripts
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
BENCHMARKING.md		BENCHMARKING.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDX: A Library for Fast Vector Search and Indexing

Index millions of vectors in seconds. Search them in milliseconds.

Why PDX?

Our secret sauce

Use Cases and Benchmarking

Usage

Installation

Prerequisites

Installation Steps

Getting the Best Performance

Roadmap

The Data Layout

`float32`

`8 bits`

Citation

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

cwida/PDX

Folders and files

Latest commit

History

Repository files navigation

PDX: A Library for Fast Vector Search and Indexing

Index millions of vectors in seconds. Search them in milliseconds.

Why PDX?

Our secret sauce

Use Cases and Benchmarking

Usage

Installation

Prerequisites

Installation Steps

Getting the Best Performance

Roadmap

The Data Layout

float32

8 bits

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`float32`

`8 bits`

Packages