-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH]: Load HNSW index without disk intermediary #5159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Enable In-Memory Loading of HNSW Index from S3 (No Disk Intermediary) This PR refactors the HNSW index loading pipeline to support direct in-memory loading of index data from S3, rather than writing and reading intermediate files to/from disk. It updates interface contracts, adds a new code path for loading buffers directly, introduces relevant changes throughout the index provider and index types, and bumps the hnswlib dependency to a specific commit supporting this feature. Key Changes• Added PersistentIndex::load_from_hnsw_data API (with trait and implementations) to allow direct in-memory index loading. Affected Areas• rust/index/src/hnsw_provider.rs This summary was automatically generated by @propel-code-bot |
8 Jobs Failed: PR checks / Lint
PR checks / Rust tests / test-benches (blacksmith-16vcpu-ubuntu-2204, --bench get)
PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_embeddings.py)
PR checks / Rust tests / Integration test ci_k8s_integration 2
PR checks / Rust tests / test (blacksmith-8vcpu-ubuntu-2204)
PR checks / Rust tests / test-long
PR checks / all-required-pr-checks-passed
1 job failed running on non-Blacksmith runners. Summary: 1 successful workflow, 1 failed workflow
Last updated: 2025-07-29 15:10:29 UTC |
None => match HnswIndex::load_from_hnsw_data( | ||
self.fetch_hnsw_segment(&new_id, prefix_path) | ||
.await | ||
.map_err(|e| Box::new(HnswIndexProviderForkError::FileError(*e)))?, | ||
&index_config, | ||
ef_search, | ||
new_id, | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The logic for loading the index within the fork
method appears to be incorrect. It attempts to fetch the segment from remote storage using new_id
, but the index for new_id
doesn't exist in storage yet. The files for source_id
have just been copied to a local directory.
The previous implementation using HnswIndex::load(storage_path_str, ...)
correctly loaded the index from this new local directory. Since fork
is intended to create a mutable, file-backed copy of an index, it seems the original approach of loading from the local path should be restored.
let hnsw_data = self.fetch_hnsw_segment(source_id, prefix_path).await?; | ||
let getters = [ | ||
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.header_buffer())), | ||
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.data_level0_buffer())), | ||
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.length_buffer())), | ||
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.link_list_buffer())), | ||
]; | ||
|
||
for (file, getter) in FILES.iter().zip(getters) { | ||
let file_path = index_storage_path.join(file); | ||
self.copy_bytes_to_local_file(&file_path, getter(&hnsw_data)) | ||
.await?; | ||
} | ||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
This function now fetches the entire HNSW segment into an in-memory HnswData
object before writing the individual files to disk. The previous implementation streamed each file directly. For large indexes, this change could significantly increase memory usage during the fork
operation. Was this change intentional? If fork
still needs to write to disk, perhaps restoring the previous file-by-file download logic for this function would be more memory-efficient.
None => match HnswIndex::load_from_hnsw_data( | ||
self.fetch_hnsw_segment(id, prefix_path) | ||
.await | ||
.map_err(|e| Box::new(HnswIndexProviderOpenError::FileError(*e)))?, | ||
&index_config, | ||
ef_search, | ||
*id, | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
This change successfully loads the index from memory, which aligns with the PR's goal. However, the open
function still contains calls that write the index to a temporary directory on disk (create_dir_all
, load_hnsw_segment_into_directory
) before this memory-based loading occurs. These disk operations now seem redundant.
To fully load without a disk intermediary and improve efficiency, you could remove the calls to create_dir_all
, load_hnsw_segment_into_directory
, and purge_one_id
from this function.
Description of changes
(WIP)
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?