Tutorial

This tutorial shows the basic usage of Reconfigurable Inverted Index (Rii) [Matsui18]; a fast and memory-efficient approximate nearest neighbor search method.

Rii has two key features:

  • Subset search: Rii enables efficient search on a subset of the whole database, regardless of the size of the subset.
  • Reconfiguration: Rii remains fast, even after many new items are added, because the data structure is dynamically adjusted for the current number of database items.

Based on the well-known inverted file with product quantization (PQ) approach (so called IVFADC or IVFPQ) [Jegou11], the data layout of Rii is designed such that an item can be fetched by its identifier with a cost of O(1). This simple but critical modification enables us to search over a subset of the dataset efficiently by switching to a linear PQ scan if the size of the subset is small. Owing to this linear layout, the granularity of a coarse assignment step can easily be controlled by running clustering again over the dataset whenever the user wishes. This means that the data structure can be adjusted dynamically after new items are added.

Th core part of Rii is implemented in C++11, and bound to the Python interface by pybind11. Encoding is automatically parallelized by OpenMP. The package is tested on linux with g++. It should work on Mac with clang (without OpenMP) as well, but not fully tested.

Basic of Rii

Let us first prepare 10,000 128-dim database vectors to be indexed, 1,000 128-dim vectors for training, and a 128-dim query vector. They must be np.ndarray with np.float32. Our objective is to find similar vectors to the query from the database vectors efficiently.

import rii
import nanopq
import numpy as np

N, Nt, D = 10000, 1000, 128
X = np.random.random((N, D)).astype(np.float32)  # 10,000 128-dim vectors to be searched
Xt = np.random.random((Nt, D)).astype(np.float32)  # 1,000 128-dim vectors for training
q = np.random.random((D,)).astype(np.float32)  # a 128-dim vector

First, a PQ/OPQ codec (nanopq.PQ or nanopq.OPQ) needs to be prepared. This codec will be used to encode/decode vectors. Note that PQ refers to the usual Product Quantization [Jegou11]. OPQ means Optimized Product Quantization [Ge14], which is an extended version of PQ. Compared to PQ, OPQ is little bit slower for encoding/searching but slightly more accurate.

# Prepare a PQ/OPQ codec with M=32 sub spaces
codec = nanopq.PQ(M=32, Ks=256, verbose=True).fit(vecs=Xt)  # Trained using Xt

Here, M is a parameter to control the runtime, accuracy, and memory-consumption. Each input vector is divided by M parts later (hence D must be dividable by M). With a larger M value, the search becomes more accurate but slower with a larger memory footprint. Another parameter Ks can be 256 for usual cases. See the tutorial of nanopq for more details about the parameter selection of the codec. Note that you can use X or the part of X for training if you cannot prepare training vectors Xt. For example: codec = nanopq.PQ(M=32).fit(vecs=X[:1000])

Let’s instantiate a search class, rii.Rii, with the trained codec as a fine quantizer:

# Instantiate a Rii class with the codec
e = rii.Rii(fine_quantizer=codec)

The codec will be stored in rii.Rii.fine_quantizer. You can see its codewords via rii.Rii.codewords.

Next, let us add the vectors X into the rii instance by running rii.Rii.add_configure():

# Add vectors
e.add_configure(vecs=X, nlist=None, iter=5)

Inside this function, rii.Rii.add() and rii.Rii.reconfigure() are called:

  • rii.Rii.add()
  • rii.Rii.reconfigure()
    • For the fast search, an inverted index structure is created by this function.
    • The PQ-codes are groupted into several clusters via PQk-means [Matsui17]. You can access the resultant cluster centers via rii.Rii.coarse_centers. The assignment for each PQ-code to its nearest center is stored on rii.Rii.posting_lists.
    • The number of centers is denoted by the parameter nlist. The default value is None, where nlist is set to sqrt(N) automatically as suggested here. The number of iteration for the clustering process is specified by iter.

Make sure that you must call rii.Rii.add_configure() (not rii.Rii.add()) for the first data addition. It is because you need to create coarse centers (posting lists). Note that, if you would like to add vectors sequentially, please refer this; Adding vectors sequentially

Hint

By the way, you can construct a codec at the same time as the instantiation of the Rii class if you want to write them in one line.

e = rii.Rii(fine_quantizer=nanopq.PQ(M=32).fit(vecs=Xt))
e.add_configure(vecs=X)

Furthermore, you can even construct the class and add the vectors in the same line by chaining functions.

e = rii.Rii(fine_quantizer=nanopq.PQ(M=32).fit(vecs=Xt)).add_configure(vecs=X)

Finally, we can run a search for a given query vector q.

# Search
ids, dists = e.query(q=q, topk=3, L=None, target_ids=None, sort_target_ids=True, method='auto')
print(ids, dists)  # e.g., [7484 8173 1556] [15.06257439 15.38533878 16.16935158]

See the docstring rii.Rii.query() for the details of each parameter. I recommend running the search with the default parameters first. For parameter tuning, please refer Guideline for search for more details.

Data addition and reconfiguration

Although there exist many fast ANN algorithms, almost all methods are optimized for an initial item set. It is not always clear how the search performance degrades when many items are newly added. Rii provides a reconfigure function, by which the search remains fast even after many vectors are newly added.

Let us first show how to add new vectors. Suppose a Rii instance is constructed with 10,000 items. Given the constructed Rii instance, you can call rii.Rii.add() to add new vectors. The search can be conducted by rii.Rii.query(). This works well when X2 is small enough:

# Suppose e was constructed with 10,000 PQ-codes.

# Add new vectors
X2 = np.random.random((1000, D)).astype(np.float32)
e.add(vecs=X2)  # Now N is 11000
e.query(q=q)  # Ok. (0.12 msec / query)

However, if you add quite a lot of vectors, the search might become slower. It is because the data structure has been optimized for the initial items (N=10000).

X3 = np.random.random((1000000, D)).astype(np.float32)
e.add(vecs=X3)  # A lot. Now N is 1011000
e.query(q=q)  # Slower (0.96 msec/query)

In such case, you can run rii.Rii.reconfigure(). That updates the data structure (re-computes the coarse centers and posting lsits), making the search faster.

e.reconfigure(nlist=None, iter=5)
e.query(q=q)  # Ok. (0.21 msec / query)

Note that, if you want, the above addition and reconfiguration can be achieved at the same time with one line by rii.Rii.add_configure():

X3 = np.random.random((1000000, D)).astype(np.float32)
e.add_configure(vecs=X3, nlist=None, iter=5)

I/O by pickling

The rii class supports pickling. You can read/write an instance easily.

import pickle

with open('rii.pkl', 'wb') as f:
    pickle.dump(e, f)

with open('rii.pkl', 'rb') as f:
    e_dumped = pickle.load(f)  # e_dumped is identical to e

Utility functions

There are some utility functions: rii.Rii.print_params(), rii.Rii.clear(), and rii.Rii.merge().

# Print the current parameters
e.print_params()

# Delete all PQ-codes and posting lists. fine_quantizer is kept.
e.clear()

# You can merge two Rii instances if they have the same fine_quantizer
e1 = rii.Rii(fine_quantizer=codec)
e2 = rii.Rii(fine_quantizer=codec)
e1.add_reconfigure(vecs=X1)
e2.add_reconfigure(vecs=X2)
e1.merge(e2)  # e1 will have (PQ-codes of) both X1 and X2

More examples

See more advanced examples as follows

[Jegou11](1, 2)
  1. Jegou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search”, IEEE TPAMI 2011
[Ge14]
  1. Ge, K. He, Q. Ke, and J. Sun, “Optimized Product Quantization”, IEEE TPAMI 2014
[Matsui17]
  1. Matsui, K. Ogaki, T. Yamasaki, and K. Aizawa, “PQk-means: Billion-scale Clustering for Product-quantized Codes”, ACM Multimedia 2017