A Streamlit application that demonstrates the differences between IVF (Inverted File Index) and HNSW (Hierarchical Navigable Small World) indexing methods for approximate nearest neighbor search.
This app allows you to:
- Compare performance metrics between IVF and HNSW methods
- Adjust parameters to see their impact on search performance
- Visualize differences in build time, query time, recall, and memory usage
- Learn when to use each method for your specific use cases
- Python 3.8+ (Python 3.12+ recommended for best compatibility with the latest dependencies)
- pip
- Clone this repository:
git clone https://github.com/yourusername/ivf-index.git
cd ivf-index
- Create and activate a virtual environment (recommended):
# On macOS/Linux
python -m venv venv
source venv/bin/activate
# On Windows
python -m venv venv
venv\Scripts\activate
- Install the required dependencies:
pip install -r requirements.txt
Note: If you encounter installation errors, try updating pip first:
pip install --upgrade pip setuptools wheel
Run the Streamlit app with:
streamlit run app.py
The app will open in your default web browser at http://localhost:8501
.
- Interactive Parameters: Adjust dataset size, dimensions, and algorithm-specific parameters
- Performance Metrics: Compare build time, query time, recall, and memory usage
- Visualization: Bar charts showing relative performance
- Educational Content: Explanations of how each algorithm works and when to use them
- The app loads and processes the 20 Newsgroups dataset, converting text to TF-IDF vectors
- PCA is applied to reduce dimensionality to a manageable size
- Both IVF and HNSW indices are built with user-specified parameters
- Random query vectors are selected from the dataset
- Search is performed with both methods and compared to an exact (brute force) search
- Results are displayed as metrics and visualizations
- Clustering-based approach
- Partitions vectors into Voronoi cells
- Searches only within relevant clusters
- Memory-efficient but requires careful parameter tuning
- Graph-based approach with multiple layers
- Creates "highways" for fast traversal
- Generally higher recall and faster queries
- More memory intensive but handles high dimensions well
Contributions are welcome! Please feel free to submit a Pull Request.