API Reference¶
Reconfigurable Inverted Index (Rii)¶

class
rii.
Rii
(fine_quantizer)¶ Reconfigurable Inverted Index (Rii) [Matsui18].
Fast and memory efficient approximate nearest neighbor search method based on IVFADC. In addition to the usual ANN search, Rii provides two useful functionalities: (1) efficient search over a subset of the whole dataset, and (2) reconfiguration of the data structure such that the search remains fast even after many items are newly added.
[Matsui18]  Matsui, R. Hinami, and S. Satoh, “Reconfigurable Inverted Index”, ACM Multimedia 2018
Parameters: fine_quantizer (object) – The instance for encoding/decoding vectors. nanopq.PQ or nanopq.OPQ. This must have been already trained. 
fine_quantizer
¶ The instance for encoding/decoding vectors. nanopq.PQ or nanopq.OPQ.
Type: object

threshold
¶ The threshold function for the subsetsearch. An instance of np.poly1d class. Given
L
, compute the threshold for the selection of the search method:S_thre = threshold(L)
.Type: object

M
¶ The number of subspaces for PQ/OPQ
Type: int

Ks
¶ The number of codewords for each subspace, typically 256.
Type: int

N
¶ The number of vectors (PQcodes)
Type: int

nlist
¶ The number of posting lists
Type: int

codewords
¶ The codewords for PQ/OPQ, where shape=(M, Ks, D/M) with dtype=np.float32. Note that
codewords[m][ks]
specifies ksth codeword (Dsdim) for mth subspace.Type: np.ndarray

coarse_centers
¶ The centers for coarse assignments, where as they are also PQcodes. Note that shape=(nlist, M) with dtype=np.uint8. If coarse_centers have not been created yet (i.e.,
reconfigure()
oradd_configure()
has not been called), this returns None.Type: np.ndarray

codes
¶ The database PQcodes, where shape=(N, M) with dtype=np.uint8. Accessing this property would be slow because the whole data is converted online from std::vector<unsigned char> in the cppinstance to np.array. If vectors have not been added yet (i.e.,
add()
oradd_configure()
has not been called), this returns None.Type: np.ndarray

posting_lists
¶ The posting lists for assignment of identifiers. List of list. Note that
len(posting_lists) == nlist
, andposting_lists[no]
contains the IDs of PQcodes (a list of int), where their nearest coarse_center is noth one.Type: list

verbose
¶ Verbose flag. Rewritable.
Type: bool

L0
¶ The average length of each posting list, i.e.,
L0 == int(round(N/nlist))
. If nlist=0, return None.Type: int

reconfigure
(nlist=None, iter=5)¶ Given a new
nlist
, update thecoarse_centers
andposting_lists
by grouping the stored PQcodes (codes
) intonlist
clusters.You can call this if you find the search becomes slower after you add too many new items. After the reconfiguration, the
threshold
is also updated. See Alg. 3 in the paper for more details.Parameters:  nlist (int) – The new number of posting lists. If None is specified (dafault value), nlist=sqrt(N) is automatically set.
 iter (int) – The number of iteration for pqkmeans to update
coarse_centers
.

add
(vecs, update_posting_lists='auto')¶ Push back new vectors to the system. More specifically, this function (1) encodes the new vectors into PQ/OPQcodes, (2) pushes back them to the existing database PQ/OPQcodes, and (3) finds the nearest coarsecenter for each code and updates the corresponding posting list. Note that (3) is done only if (a)
update_posting_lists=True
or (b)update_posting_lists='auto'
and 0 <nlist
.In usual cases, update_posting_lists should be True. It will be False only if (1) this is the first call of data addition, i.e., coarse_centers have not been created yet. This is the case when
add()
is called insideadd_configure()
. Usually, you don’t need to care about this case. (2) you plan to calladd()
several times (possibly because you read data in a batch way) and then runreconfigure()
. In such case, you can skip updating posting lists because they will be again updated inreconfigure()
.If you set ‘auto’ for update_posting_lists, it will be automatically set correctly, i.e., True is set if the posting lists have alreay been created (0 < nlist), and False is set otherwise.
Parameters:  vecs (np.ndarray) – The new vectors with the shape=(Nv, D) and dtype=np.float32, where Nv is the number of new vectors.
 update_posting_lists (bool or str) – True or False or ‘auto’. If True,
posting_lists
will be updated. This should be True for usual cases.

add_configure
(vecs, nlist=None, iter=5)¶ Run
add()
(withupdate_postig_lists=False
) andreconfigure()
. To add vectors for the first time, please call this.Parameters:  vecs (np.ndarray) – The new vectors with the shape=(Nv, D) and dtype=np.float32, where Nv is the number of new vectors.
 nlist (int) – The new number of posting lists. If None is specified (dafault value), nlist=sqrt(N) is automatically set.
 iter (int) – The number of iteration for pqkmeans to update
coarse_centers
.
Returns: self

merge
(engine, update_posting_lists='auto')¶ Given another Rii instance (engine), merge its PQcodes (engine.codes) to this instance. IDs for new PQcodes are automatically assigned. For example, if self.N = 100, the IDs of the merged PQcodes would be 100, 101, …
The original posting lists of this instance will be kept maintained; new PQcodes will be simply added over that. Thus it might be better to call
reconfigure()
after many new codes are merged.Parameters:  engine (rii.Rii) – Rii instance. engine.fine_quantizer should be the same as self.fine_quantizer
 update_posting_lists (bool or str) – True or False or ‘auto’. If True,
posting_lists
will be updated. This should be True for usual cases.

query
(q, topk=1, L=None, target_ids=None, sort_target_ids=True, method='auto')¶ Given a query vector, run the approximate nearest neighbor search over the stored PQcodes. This functions returns the identifiers and the distances of
topk
nearest PQcodes to the query.The search can be conducted over a subset of the whole PQcodes by specifying
target_ids
. For example, iftarget_ids=np.array([23, 567, 998])
, the search result would be the items with these identifiers, sorted by the distance to the query.Inside this function, the algorithm for the search is selected either from PQlinearscan (see Alg. 1 in the paper) or invertedindex (see Alg. 2 in the paper). This is specified by
method
, by setting ‘linear’ or ‘ivf’. If ‘auto’ is set, the faster one is automatically selected (See Alg. 3 in the paper for more details).See Guideline for search for tips of the parameter selection.
Parameters:  q (np.ndarray) – The query vector with the shape=(D, ) and dtype=np.float32.
 topk (int) – The number of PQcodes to be returned. The default value is 1.
 L (int) – The number of PQcodes for the candidates of distance evaluation.
With a higher
L
value, the accuracy is boosted but the runtime gets slower. The default value is a minimum multiple ofL0
that coverstopk
. This is typicallyL0
. Note thatL
is used only if the search method is invertedindex.  target_ids (np.ndarray) – The target identifiers with the shape=(S, ) and dtype=np.int64, where S can
be any scalar. The default value is None, then the search is run over the whole dataset.
Note that
target_ids
does not need to be sorted ifsort_target_ids==True
, where it will be sorted inside this function automatically. Otherwise please sorttarget_ids
before calling this function.  sort_target_ids (bool) – The default value is True. If True,
target_ids
will be sorted automatically inside this function before running the search. If False,target_ids
will not be sorted. Note thattarget_ids
must be sorted before the search, so please sort it by yourself if you setsort_target_ids
as False.  method (str) – The search algorithm to be used: ‘linear’, ‘ivf’, or ‘auto’.
Returns: The result (nearest items) of the search. The first one is the identifiers of the items, with the shape=(topk, ) and dtype=int64. The second one is the distances of the items to the query, with the shape=(topk, ) and dtype=float64
Return type: (np.ndarray, np.ndarray)

clear
()¶ Clear all data, i.e., (1) coarse_centers, (2) PQcodes, (3) posting_lists, and (4) threshold function. Note that codewords are kept.

print_params
()¶ Print all parameters with some verbose information for debugging.