Open
Description
Currently, we are computing the distances between each instance and each cluster centroid using a matrix multiplication of both matrices. This requires n_instances x n_centroids
memory. When more than one thread is used, we have multiple such matrices in flight, resulting in large memory use.
Alternatively, we could compute the centroid distances per instance. This requires far less memory, but is more CPU intensive.
The reductive API should offer both approaches.