API Reference

Reconfigurable Inverted Index (Rii)

class rii.Rii(fine_quantizer)

Reconfigurable Inverted Index (Rii) [Matsui18].

Fast and memory efficient approximate nearest neighbor search method based on IVFADC. In addition to the usual ANN search, Rii provides two useful functionalities: (1) efficient search over a subset of the whole dataset, and (2) reconfiguration of the data structure such that the search remains fast even after many items are newly added.

[Matsui18]
  1. Matsui, R. Hinami, and S. Satoh, “Reconfigurable Inverted Index”, ACM Multimedia 2018
Parameters:fine_quantizer (object) – The instance for encoding/decoding vectors. nanopq.PQ or nanopq.OPQ. This must have been already trained.
fine_quantizer

The instance for encoding/decoding vectors. nanopq.PQ or nanopq.OPQ.

Type:object
threshold

The threshold function for the subset-search. An instance of np.poly1d class. Given L, compute the threshold for the selection of the search method: S_thre = threshold(L).

Type:object
M

The number of sub-spaces for PQ/OPQ

Type:int
Ks

The number of codewords for each sub-space, typically 256.

Type:int
N

The number of vectors (PQ-codes)

Type:int
nlist

The number of posting lists

Type:int
codewords

The codewords for PQ/OPQ, where shape=(M, Ks, D/M) with dtype=np.float32. Note that codewords[m][ks] specifies ks-th codeword (Ds-dim) for m-th subspace.

Type:np.ndarray
coarse_centers

The centers for coarse assignments, where as they are also PQ-codes. Note that shape=(nlist, M) with dtype=np.uint8. If coarse_centers have not been created yet (i.e., reconfigure() or add_configure() has not been called), this returns None.

Type:np.ndarray
codes

The database PQ-codes, where shape=(N, M) with dtype=np.uint8. Accessing this property would be slow because the whole data is converted online from std::vector<unsigned char> in the cpp-instance to np.array. If vectors have not been added yet (i.e., add() or add_configure() has not been called), this returns None.

Type:np.ndarray
posting_lists

The posting lists for assignment of identifiers. List of list. Note that len(posting_lists) == nlist, and posting_lists[no] contains the IDs of PQ-codes (a list of int), where their nearest coarse_center is no-th one.

Type:list
verbose

Verbose flag. Rewritable.

Type:bool
L0

The average length of each posting list, i.e., L0 == int(round(N/nlist)). If nlist=0, return None.

Type:int
reconfigure(nlist=None, iter=5)

Given a new nlist, update the coarse_centers and posting_lists by grouping the stored PQ-codes (codes) into nlist clusters.

You can call this if you find the search becomes slower after you add too many new items. After the reconfiguration, the threshold is also updated. See Alg. 3 in the paper for more details.

Parameters:
  • nlist (int) – The new number of posting lists. If None is specified (dafault value), nlist=sqrt(N) is automatically set.
  • iter (int) – The number of iteration for pqk-means to update coarse_centers.
add(vecs, update_posting_lists='auto')

Push back new vectors to the system. More specifically, this function (1) encodes the new vectors into PQ/OPQ-codes, (2) pushes back them to the existing database PQ/OPQ-codes, and (3) finds the nearest coarse-center for each code and updates the corresponding posting list. Note that (3) is done only if (a) update_posting_lists=True or (b) update_posting_lists='auto' and 0 < nlist.

In usual cases, update_posting_lists should be True. It will be False only if (1) this is the first call of data addition, i.e., coarse_centers have not been created yet. This is the case when add() is called inside add_configure(). Usually, you don’t need to care about this case. (2) you plan to call add() several times (possibly because you read data in a batch way) and then run reconfigure(). In such case, you can skip updating posting lists because they will be again updated in reconfigure().

If you set ‘auto’ for update_posting_lists, it will be automatically set correctly, i.e., True is set if the posting lists have alreay been created (0 < nlist), and False is set otherwise.

Parameters:
  • vecs (np.ndarray) – The new vectors with the shape=(Nv, D) and dtype=np.float32, where Nv is the number of new vectors.
  • update_posting_lists (bool or str) – True or False or ‘auto’. If True, posting_lists will be updated. This should be True for usual cases.
add_configure(vecs, nlist=None, iter=5)

Run add() (with update_postig_lists=False) and reconfigure(). To add vectors for the first time, please call this.

Parameters:
  • vecs (np.ndarray) – The new vectors with the shape=(Nv, D) and dtype=np.float32, where Nv is the number of new vectors.
  • nlist (int) – The new number of posting lists. If None is specified (dafault value), nlist=sqrt(N) is automatically set.
  • iter (int) – The number of iteration for pqk-means to update coarse_centers.
Returns:

self

merge(engine, update_posting_lists='auto')

Given another Rii instance (engine), merge its PQ-codes (engine.codes) to this instance. IDs for new PQ-codes are automatically assigned. For example, if self.N = 100, the IDs of the merged PQ-codes would be 100, 101, …

The original posting lists of this instance will be kept maintained; new PQ-codes will be simply added over that. Thus it might be better to call reconfigure() after many new codes are merged.

Parameters:
  • engine (rii.Rii) – Rii instance. engine.fine_quantizer should be the same as self.fine_quantizer
  • update_posting_lists (bool or str) – True or False or ‘auto’. If True, posting_lists will be updated. This should be True for usual cases.
query(q, topk=1, L=None, target_ids=None, sort_target_ids=True, method='auto')

Given a query vector, run the approximate nearest neighbor search over the stored PQ-codes. This functions returns the identifiers and the distances of topk nearest PQ-codes to the query.

The search can be conducted over a subset of the whole PQ-codes by specifying target_ids. For example, if target_ids=np.array([23, 567, 998]), the search result would be the items with these identifiers, sorted by the distance to the query.

Inside this function, the algorithm for the search is selected either from PQ-linear-scan (see Alg. 1 in the paper) or inverted-index (see Alg. 2 in the paper). This is specified by method, by setting ‘linear’ or ‘ivf’. If ‘auto’ is set, the faster one is automatically selected (See Alg. 3 in the paper for more details).

See Guideline for search for tips of the parameter selection.

Parameters:
  • q (np.ndarray) – The query vector with the shape=(D, ) and dtype=np.float32.
  • topk (int) – The number of PQ-codes to be returned. The default value is 1.
  • L (int) – The number of PQ-codes for the candidates of distance evaluation. With a higher L value, the accuracy is boosted but the runtime gets slower. The default value is a minimum multiple of L0 that covers topk. This is typically L0. Note that L is used only if the search method is inverted-index.
  • target_ids (np.ndarray) – The target identifiers with the shape=(S, ) and dtype=np.int64, where S can be any scalar. The default value is None, then the search is run over the whole dataset. Note that target_ids does not need to be sorted if sort_target_ids==True, where it will be sorted inside this function automatically. Otherwise please sort target_ids before calling this function.
  • sort_target_ids (bool) – The default value is True. If True, target_ids will be sorted automatically inside this function before running the search. If False, target_ids will not be sorted. Note that target_ids must be sorted before the search, so please sort it by yourself if you set sort_target_ids as False.
  • method (str) – The search algorithm to be used: ‘linear’, ‘ivf’, or ‘auto’.
Returns:

The result (nearest items) of the search. The first one is the identifiers of the items, with the shape=(topk, ) and dtype=int64. The second one is the distances of the items to the query, with the shape=(topk, ) and dtype=float64

Return type:

(np.ndarray, np.ndarray)

clear()

Clear all data, i.e., (1) coarse_centers, (2) PQ-codes, (3) posting_lists, and (4) threshold function. Note that codewords are kept.

print_params()

Print all parameters with some verbose information for debugging.