API Reference
Reconfigurable Inverted Index (Rii)
- class rii.Rii(fine_quantizer)
Reconfigurable Inverted Index (Rii) [Matsui18].
Fast and memory efficient approximate nearest neighbor search method based on IVFADC. In addition to the usual ANN search, Rii provides two useful functionalities: (1) efficient search over a subset of the whole dataset, and (2) reconfiguration of the data structure such that the search remains fast even after many items are newly added.
[Matsui18]Matsui, R. Hinami, and S. Satoh, “Reconfigurable Inverted Index”, ACM Multimedia 2018
- Parameters:
fine_quantizer (object) – The instance for encoding/decoding vectors. nanopq.PQ or nanopq.OPQ. This must have been already trained.
- fine_quantizer
The instance for encoding/decoding vectors. nanopq.PQ or nanopq.OPQ.
- Type:
object
- threshold
The threshold function for the subset-search. An instance of np.poly1d class. Given
L, compute the threshold for the selection of the search method:S_thre = threshold(L).- Type:
object
- property M
The number of sub-spaces for PQ/OPQ
- Type:
int
- property Ks
The number of codewords for each sub-space, typically 256.
- Type:
int
- property N
The number of vectors (PQ-codes)
- Type:
int
- property nlist
The number of posting lists
- Type:
int
- property codewords
The codewords for PQ/OPQ, where shape=(M, Ks, D/M) with dtype=np.float32. Note that
codewords[m][ks]specifies ks-th codeword (Ds-dim) for m-th subspace.- Type:
np.ndarray
- property coarse_centers
The centers for coarse assignments, where as they are also PQ-codes. Note that shape=(nlist, M) with dtype=np.uint8. If coarse_centers have not been created yet (i.e.,
reconfigure()oradd_configure()has not been called), this returns None.- Type:
np.ndarray
- property codes
The database PQ-codes, where shape=(N, M) with dtype=np.uint8. Accessing this property would be slow because the whole data is converted online from std::vector<unsigned char> in the cpp-instance to np.array. If vectors have not been added yet (i.e.,
add()oradd_configure()has not been called), this returns None.- Type:
np.ndarray
- property posting_lists
The posting lists for assignment of identifiers. List of list. Note that
len(posting_lists) == nlist, andposting_lists[no]contains the IDs of PQ-codes (a list of int), where their nearest coarse_center is no-th one.- Type:
list
- property verbose
Verbose flag. Rewritable.
- Type:
bool
- property L0
The average length of each posting list, i.e.,
L0 == int(round(N/nlist)). If nlist=0, return None.- Type:
int
- reconfigure(nlist=None, iter=5)
Given a new
nlist, update thecoarse_centersandposting_listsby grouping the stored PQ-codes (codes) intonlistclusters.You can call this if you find the search becomes slower after you add too many new items. After the reconfiguration, the
thresholdis also updated. See Alg. 3 in the paper for more details.- Parameters:
nlist (int) – The new number of posting lists. If None is specified (dafault value), nlist=sqrt(N) is automatically set.
iter (int) – The number of iteration for pqk-means to update
coarse_centers.
- add(vecs, update_posting_lists='auto')
Push back new vectors to the system. More specifically, this function (1) encodes the new vectors into PQ/OPQ-codes, (2) pushes back them to the existing database PQ/OPQ-codes, and (3) finds the nearest coarse-center for each code and updates the corresponding posting list. Note that (3) is done only if (a)
update_posting_lists=Trueor (b)update_posting_lists='auto'and 0 <nlist.In usual cases, update_posting_lists should be True. It will be False only if (1) this is the first call of data addition, i.e., coarse_centers have not been created yet. This is the case when
add()is called insideadd_configure(). Usually, you don’t need to care about this case. (2) you plan to calladd()several times (possibly because you read data in a batch way) and then runreconfigure(). In such case, you can skip updating posting lists because they will be again updated inreconfigure().If you set ‘auto’ for update_posting_lists, it will be automatically set correctly, i.e., True is set if the posting lists have alreay been created (0 < nlist), and False is set otherwise.
- Parameters:
vecs (np.ndarray) – The new vectors with the shape=(Nv, D) and dtype=np.float32, where Nv is the number of new vectors.
update_posting_lists (bool or str) – True or False or ‘auto’. If True,
posting_listswill be updated. This should be True for usual cases.
- add_configure(vecs, nlist=None, iter=5)
Run
add()(withupdate_postig_lists=False) andreconfigure(). To add vectors for the first time, please call this.- Parameters:
vecs (np.ndarray) – The new vectors with the shape=(Nv, D) and dtype=np.float32, where Nv is the number of new vectors.
nlist (int) – The new number of posting lists. If None is specified (dafault value), nlist=sqrt(N) is automatically set.
iter (int) – The number of iteration for pqk-means to update
coarse_centers.
- Returns:
self
- merge(engine, update_posting_lists='auto')
Given another Rii instance (engine), merge its PQ-codes (engine.codes) to this instance. IDs for new PQ-codes are automatically assigned. For example, if self.N = 100, the IDs of the merged PQ-codes would be 100, 101, …
The original posting lists of this instance will be kept maintained; new PQ-codes will be simply added over that. Thus it might be better to call
reconfigure()after many new codes are merged.- Parameters:
engine (rii.Rii) – Rii instance. engine.fine_quantizer should be the same as self.fine_quantizer
update_posting_lists (bool or str) – True or False or ‘auto’. If True,
posting_listswill be updated. This should be True for usual cases.
- query(q, topk=1, L=None, target_ids=None, sort_target_ids=True, method='auto')
Given a query vector, run the approximate nearest neighbor search over the stored PQ-codes. This functions returns the identifiers and the distances of
topknearest PQ-codes to the query.The search can be conducted over a subset of the whole PQ-codes by specifying
target_ids. For example, iftarget_ids=np.array([23, 567, 998]), the search result would be the items with these identifiers, sorted by the distance to the query.Inside this function, the algorithm for the search is selected either from PQ-linear-scan (see Alg. 1 in the paper) or inverted-index (see Alg. 2 in the paper). This is specified by
method, by setting ‘linear’ or ‘ivf’. If ‘auto’ is set, the faster one is automatically selected (See Alg. 3 in the paper for more details).See Guideline for search for tips of the parameter selection.
- Parameters:
q (np.ndarray) – The query vector with the shape=(D, ) and dtype=np.float32.
topk (int) – The number of PQ-codes to be returned. The default value is 1.
L (int) – The number of PQ-codes for the candidates of distance evaluation. With a higher
Lvalue, the accuracy is boosted but the runtime gets slower. The default value is a minimum multiple ofL0that coverstopk. This is typicallyL0. Note thatLis used only if the search method is inverted-index.target_ids (np.ndarray) – The target identifiers with the shape=(S, ) and dtype=np.int64, where S can be any scalar. The default value is None, then the search is run over the whole dataset. Note that
target_idsdoes not need to be sorted ifsort_target_ids==True, where it will be sorted inside this function automatically. Otherwise please sorttarget_idsbefore calling this function.sort_target_ids (bool) – The default value is True. If True,
target_idswill be sorted automatically inside this function before running the search. If False,target_idswill not be sorted. Note thattarget_idsmust be sorted before the search, so please sort it by yourself if you setsort_target_idsas False.method (str) – The search algorithm to be used: ‘linear’, ‘ivf’, or ‘auto’.
- Returns:
The result (nearest items) of the search. The first one is the identifiers of the items, with the shape=(topk, ) and dtype=int64. The second one is the distances of the items to the query, with the shape=(topk, ) and dtype=float64
- Return type:
(np.ndarray, np.ndarray)
- clear()
Clear all data, i.e., (1) coarse_centers, (2) PQ-codes, (3) posting_lists, and (4) threshold function. Note that codewords are kept.
- print_params()
Print all parameters with some verbose information for debugging.