This tutorial shows the basic usage of Reconfigurable Inverted Index (Rii) [Matsui18]; a fast and memory-efficient approximate nearest neighbor search method.
Rii has two key features:
- Subset search: Rii enables efficient search on a subset of the whole database, regardless of the size of the subset.
- Reconfiguration: Rii remains fast, even after many new items are added, because the data structure is dynamically adjusted for the current number of database items.
Based on the well-known inverted file with product quantization (PQ) approach (so called IVFADC or IVFPQ) [Jegou11], the data layout of Rii is designed such that an item can be fetched by its identifier with a cost of O(1). This simple but critical modification enables us to search over a subset of the dataset efficiently by switching to a linear PQ scan if the size of the subset is small. Owing to this linear layout, the granularity of a coarse assignment step can easily be controlled by running clustering again over the dataset whenever the user wishes. This means that the data structure can be adjusted dynamically after new items are added.
Th core part of Rii is implemented in C++11, and bound to the Python interface by pybind11. Encoding is automatically parallelized by OpenMP. The package is tested on linux with g++. It should work on Mac with clang (without OpenMP) as well, but not fully tested.
Basic of Rii¶
Let us first prepare 10,000 128-dim database vectors to be indexed, 1,000 128-dim vectors for training, and a 128-dim query vector. They must be np.ndarray with np.float32. Our objective is to find similar vectors to the query from the database vectors efficiently.
import rii import nanopq import numpy as np N, Nt, D = 10000, 1000, 128 X = np.random.random((N, D)).astype(np.float32) # 10,000 128-dim vectors to be searched Xt = np.random.random((Nt, D)).astype(np.float32) # 1,000 128-dim vectors for training q = np.random.random((D,)).astype(np.float32) # a 128-dim vector
First, a PQ/OPQ codec (nanopq.PQ or nanopq.OPQ) needs to be prepared. This codec will be used to encode/decode vectors. Note that PQ refers to the usual Product Quantization [Jegou11]. OPQ means Optimized Product Quantization [Ge14], which is an extended version of PQ. Compared to PQ, OPQ is little bit slower for encoding/searching but slightly more accurate.
# Prepare a PQ/OPQ codec with M=32 sub spaces codec = nanopq.PQ(M=32, Ks=256, verbose=True).fit(vecs=Xt) # Trained using Xt
M is a parameter to control the runtime, accuracy, and memory-consumption.
Each input vector is divided by
M parts later (hence
D must be dividable by
With a larger
M value, the search becomes more accurate but slower with a larger memory footprint.
Ks can be 256 for usual cases.
See the tutorial of nanopq
for more details about the parameter selection of the codec.
Note that you can use
X or the part of
X for training if you
cannot prepare training vectors
Xt. For example:
codec = nanopq.PQ(M=32).fit(vecs=X[:1000])
Let’s instantiate a search class,
rii.Rii, with the trained codec as a fine quantizer:
# Instantiate a Rii class with the codec e = rii.Rii(fine_quantizer=codec)
Next, let us add the vectors
X into the rii instance
# Add vectors e.add_configure(vecs=X, nlist=None, iter=5)
- For the fast search, an inverted index structure is created by this function.
- The PQ-codes are groupted into several clusters via PQk-means [Matsui17].
You can access the resultant cluster centers via
rii.Rii.coarse_centers. The assignment for each PQ-code to its nearest center is stored on
- The number of centers is denoted by the parameter
nlist. The default value is None, where
nlistis set to
sqrt(N)automatically as suggested here. The number of iteration for the clustering process is specified by
Make sure that you must call
for the first data addition. It is because you need to create coarse centers (posting lists).
Note that, if you would like to add vectors sequentially, please refer this; Adding vectors sequentially
By the way, you can construct a codec at the same time as the instantiation of the Rii class if you want to write them in one line.
e = rii.Rii(fine_quantizer=nanopq.PQ(M=32).fit(vecs=Xt)) e.add_configure(vecs=X)
Furthermore, you can even construct the class and add the vectors in the same line by chaining functions.
e = rii.Rii(fine_quantizer=nanopq.PQ(M=32).fit(vecs=Xt)).add_configure(vecs=X)
Finally, we can run a search for a given query vector
# Search ids, dists = e.query(q=q, topk=3, L=None, target_ids=None, sort_target_ids=True, method='auto') print(ids, dists) # e.g., [7484 8173 1556] [15.06257439 15.38533878 16.16935158]
See the docstring
rii.Rii.query() for the details of each parameter.
I recommend running the search with the default parameters first.
For parameter tuning, please refer
Guideline for search for more details.
The search can be conducted on a subset of the whole PQ-codes. Such subset-search is practically important. For example of image search, we can filter out unrelated images by checking their tags, and run feature-based search to find the similar images to the query.
A subset is specified simply by a numpy array,
# The search can be conducted over a subset of the database target_ids = np.array([85, 132, 236, 551, 694, 728, 992, 1234]) # Specified by IDs ids, dists = e.query(q=q, topk=3, target_ids=target_ids, sort_target_ids=True) print(ids, dists) # e.g., [728 85 132] [14.80522156 15.92787838 16.28690338]
As can be seen in the resulted identifiers
ids, the search result includes
the items specified by
target_ids only. Note that:
- Make sure
target_idsmust be np.ndarray with
- Please don’t include duplicate identifiers in
target_ids. The behavior is undefined.
- The target identifiers must be sorted before the search (see Sec 4.2 in [Matsui18] for details).
In a default setting,
sort_target_idsis True. This means that
target_idswill be sorted inside the query function, so you do not need to manually sort
rii.Rii.query(). This works practically well when
target_idsis not so large.
target_idscontans a lot of identifiers, sorting could become slower than the search itself. In such case, you can manually sort
target_idsbeforehand, and pass it to
sort_target_ids=False. This is a complete procedure explained in the paper.
Some examples of subset-search are:
# Because target ids are not sorted, sort_target_ids must be True (default behavior) e.query(q=q, topk=1, target_ids=np.array([345, 23, 994, 425])) # The search is run on the 1st to 1000th items. # Since the target_ids are already sorted, you can set False for the sort flag. e.query(q=q, topk=1, target_ids=np.arange(1000), sort_target_ids=False) # Search for several queries with a large target_ids. In such case, # it is redundant to sort inside the query function every time; you should sort only once target_ids = np.array([44432, 32786, ..., 9623]) # Lots of identifiers target_ids = np.sort(target_ids) # Do sort for q in Q: # Here, ths sort flag is off for efficient search e.search(q=q, topk=1, target_ids=target_ids, sort_target_ids=False)
Data addition and reconfiguration¶
Although there exist many fast ANN algorithms, almost all methods are optimized for an initial item set. It is not always clear how the search performance degrades when many items are newly added. Rii provides a reconfigure function, by which the search remains fast even after many vectors are newly added.
Let us first show how to add new vectors.
Suppose a Rii instance is constructed with 10,000 items.
Given the constructed Rii instance,
you can call
rii.Rii.add() to add new vectors.
The search can be conducted by
This works well when
X2 is small enough:
# Suppose e was constructed with 10,000 PQ-codes. # Add new vectors X2 = np.random.random((1000, D)).astype(np.float32) e.add(vecs=X2) # Now N is 11000 e.query(q=q) # Ok. (0.12 msec / query)
However, if you add quite a lot of vectors, the search might become slower. It is because the data structure has been optimized for the initial items (N=10000).
X3 = np.random.random((1000000, D)).astype(np.float32) e.add(vecs=X3) # A lot. Now N is 1011000 e.query(q=q) # Slower (0.96 msec/query)
In such case, you can run
That updates the data structure (re-computes the coarse centers and posting lsits),
making the search faster.
e.reconfigure(nlist=None, iter=5) e.query(q=q) # Ok. (0.21 msec / query)
Note that, if you want, the above addition and reconfiguration
can be achieved at the same time with one line by
X3 = np.random.random((1000000, D)).astype(np.float32) e.add_configure(vecs=X3, nlist=None, iter=5)
I/O by pickling¶
The rii class supports pickling. You can read/write an instance easily.
import pickle with open('rii.pkl', 'wb') as f: pickle.dump(e, f) with open('rii.pkl', 'rb') as f: e_dumped = pickle.load(f) # e_dumped is identical to e
# Print the current parameters e.print_params() # Delete all PQ-codes and posting lists. fine_quantizer is kept. e.clear() # You can merge two Rii instances if they have the same fine_quantizer e1 = rii.Rii(fine_quantizer=codec) e2 = rii.Rii(fine_quantizer=codec) e1.add_reconfigure(vecs=X1) e2.add_reconfigure(vecs=X2) e1.merge(e2) # e1 will have (PQ-codes of) both X1 and X2
See more advanced examples as follows
|[Jegou11]||(1, 2) |