Skip to content

[AI] Vector Database Integration with ChromaDB #1807

@ad-m-ss

Description

@ad-m-ss

Ticket Information

Context & Background

Implement ChromaDB integration for storing and retrieving vector embeddings of article chunks. This provides the semantic search capabilities for the AI article indexing system, enabling similarity-based article recommendations.

Reference Documents:

  • Phase 1 Implementation Plan: docs/ai/phase1-implementation.rst

Requirements & Acceptance Criteria

  • Install and configure ChromaDB with persistent storage
  • Setup collection management for article embeddings
  • Implement vector storage operations for article chunks
  • Create similarity search functionality with configurable parameters
  • Implement collection management and maintenance operations
  • Document backup and recovery procedures for vector data
  • Create comprehensive unit tests for all vector operations
  • Setup monitoring for vector database performance

Implementation Steps

1. ChromaDB Installation and Configuration

  • Add ChromaDB to project requirements (version >= 0.4.0)
  • Create knowledge/services/vector_db.py with VectorDatabase class
  • Initialize ChromaDB persistent client with the following settings:
    • Persist Directory: Configurable storage path (default: "./chroma_db")
    • Anonymized Telemetry: Disabled for privacy
    • Allow Reset: Disabled in production, enabled for development
    • Collection Name: "article_chunks" for legal article chunks
    • Collection Metadata: Description and configuration information

2. Vector Storage Operations

Create vector storage functionality with the following methods:

  • Single Embedding Storage: Store individual chunk embedding with metadata
    • Accept chunk_id, embedding vector, content text, and metadata dictionary
    • Store in ChromaDB collection with proper ID mapping
  • Batch Embedding Storage: Store multiple embeddings efficiently
    • Process arrays of embeddings, documents, metadata, and IDs
    • Use ChromaDB batch add operation for performance
  • Update Operations: Update existing chunk embeddings
    • Support updating embedding, content, and metadata for existing chunks
  • Delete Operations: Remove embeddings from the collection
    • Single chunk deletion by chunk ID
    • Bulk deletion for all chunks of a specific article
    • Support metadata-based filtering for deletions

3. Similarity Search Implementation

Implement similarity search functionality with the following features:

  • Basic Similarity Search: Query collection with embedding vector
    • Accept query embedding, result count, minimum similarity threshold
    • Support metadata filters for refined search results
    • Return results with similarity scores, content, and metadata
  • Advanced Search Options: Enhanced search capabilities
    • Configurable distance metrics (cosine similarity default)
    • Result filtering by minimum similarity threshold
    • Metadata-based filtering (article_id, categories, tags, date ranges)
  • Specialized Search Methods: Domain-specific search functionality
    • Legal case recommendation system with appropriate similarity thresholds
    • Article-to-article similarity for related content discovery
    • Category-based search with legal domain filtering

4. Collection Management and Maintenance

Implement collection management with the following capabilities:

  • Collection Statistics: Monitor collection health and usage
    • Total chunk count, collection metadata, last update timestamps
    • Performance metrics and storage usage information
  • Maintenance Operations: Collection optimization and cleanup
    • Collection reset functionality (development/testing only)
    • Orphaned embedding cleanup for deleted articles
    • Collection optimization for improved query performance
  • Backup and Recovery: Data protection and disaster recovery
    • Collection backup to external storage with timestamps
    • Restore functionality from backup files
    • Incremental backup support for large collections

5. Django Integration Service

Create knowledge/services/embedding_service.py with Django-specific functionality:

  • Django Settings Integration: Use Django configuration for ChromaDB paths
  • Model Integration: Work with Article and ArticleChunk Django models
  • Article Chunk Storage: Store all chunks for a complete article
    • Extract embeddings from ArticleChunk models
    • Include comprehensive metadata: article_id, title, chunk_index, timestamps, categories, tags
    • Use batch operations for efficient storage of multiple chunks
  • Search Interface: Provide Django-friendly search interface
    • Query method accepting text or embeddings
    • Return Django-compatible result objects
    • Integration with search logging and analytics

6. Management Commands for Vector Operations

Create knowledge/management/commands/index_articles.py with the following features:

  • Command Arguments: Flexible indexing options
    • --article-id: Index specific article by ID (optional)
    • --reset: Reset collection before indexing (development flag)
    • --batch-size: Configurable batch size for processing (default 50)
  • Processing Integration: Use ProcessingLog for monitoring and tracking
  • Batch Processing: Efficient processing of large article collections
    • Query articles with existing embeddings
    • Process in configurable batches to manage memory usage
    • Track progress and performance metrics
  • Error Handling: Comprehensive error handling with logging
    • Integration with Sentry for error tracking
    • Processing log updates for success/failure states
    • Graceful handling of individual article failures

Code Changes Required

  • Create knowledge/services/vector_db.py with ChromaDB client
  • Create knowledge/services/embedding_service.py for Django integration
  • Add ChromaDB configuration to Django settings
  • Create management commands for vector operations
  • Update Article and ArticleChunk models with vector-related methods
  • Add comprehensive unit tests for vector operations
  • Create integration tests for search functionality

External Documentation

Deliverables

  1. ChromaDB integration with persistent storage
  2. Vector storage and retrieval operations
  3. Similarity search functionality
  4. Collection management and maintenance tools
  5. Django service integration layer
  6. Management commands for vector operations
  7. Comprehensive test suite for vector operations
  8. Backup and recovery procedures documentation
  9. Performance monitoring and optimization guidelines

Performance Requirements

  • Store embeddings for 10,000+ article chunks
  • Similarity search response time < 500ms for 10 results
  • Batch operations should handle 100+ embeddings efficiently
  • Collection should support concurrent read operations
  • Backup operations should complete within 10 minutes

Next Steps

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions