forked from ahenzinger/tiptoe
-
Couldn't load subscription status.
- Fork 0
Open
0 / 10 of 1 issue completedDescription
Embeddings are quantized into int8 when written to clusters/. What effect does this quantization have on our
- search metrics
- throughput/communication
- other?
Should we be considering any other quantization schemes?
Current approach:
Document:
scaled with reference to a set of documents.
For us, this is done per cluster:
| reduced_embeddings = drm.run_pca(self.pca_components, cluster_embeddings) |
and again for the centroids:
| reduced_centroids = drm.run_pca(self.pca_components, centroids) |
method:
| # Adaptive scaling to fit within int8 range | |
| data_min = np.min(transformed) | |
| data_max = np.max(transformed) | |
| data_range = max(abs(data_min), abs(data_max)) | |
| # TODO: generalise quantisation | |
| scale_factor = 127.0 / data_range | |
| quantized = np.clip(np.round(transformed * scale_factor), -127, 127) |
nb: before quantization, pca is run with reference to entire set of documents. It is applied per cluster but with shared weights.
Query:
per query scaling. No reference to document scaling.
arc-tiptoe/src/arc_tiptoe/search/query_processor.py
Lines 216 to 223 in 5ce4188
| # Quantize embedding | |
| data_min = np.min(embedding_reduced) | |
| data_max = np.max(embedding_reduced) | |
| data_range = max(abs(data_min), abs(data_max)) | |
| scale = 127.0 / data_range | |
| embedding_quantized = np.clip( | |
| np.round(embedding_reduced * scale), -127, 127 | |
| ).astype(np.int8) |
Alternatives
- per axis scaling?
- use reference to documents when scaling queries (i.e. use same weights) - e.g. if keeping per cluster scaling, we could keep a reference set of weights per cluster and apply these before sending the query to each cluster.
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels