Skip to content

Effect of quantization and alternate schemes #34

@lannelin

Description

@lannelin

Embeddings are quantized into int8 when written to clusters/. What effect does this quantization have on our

  1. search metrics
  2. throughput/communication
  3. other?

Should we be considering any other quantization schemes?

Current approach:

Document:

scaled with reference to a set of documents.
For us, this is done per cluster:

reduced_embeddings = drm.run_pca(self.pca_components, cluster_embeddings)

and again for the centroids:
reduced_centroids = drm.run_pca(self.pca_components, centroids)

method:

# Adaptive scaling to fit within int8 range
data_min = np.min(transformed)
data_max = np.max(transformed)
data_range = max(abs(data_min), abs(data_max))
# TODO: generalise quantisation
scale_factor = 127.0 / data_range
quantized = np.clip(np.round(transformed * scale_factor), -127, 127)

nb: before quantization, pca is run with reference to entire set of documents. It is applied per cluster but with shared weights.

Query:

per query scaling. No reference to document scaling.

# Quantize embedding
data_min = np.min(embedding_reduced)
data_max = np.max(embedding_reduced)
data_range = max(abs(data_min), abs(data_max))
scale = 127.0 / data_range
embedding_quantized = np.clip(
np.round(embedding_reduced * scale), -127, 127
).astype(np.int8)

Alternatives

  • per axis scaling?
  • use reference to documents when scaling queries (i.e. use same weights) - e.g. if keeping per cluster scaling, we could keep a reference set of weights per cluster and apply these before sending the query to each cluster.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions