-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Ticket Information
- Assigned Team: Engineering Team
- Dependencies: [AI] Content Ingestion Pipeline #1806 (Content Ingestion Pipeline)
Context & Background
Implement ChromaDB integration for storing and retrieving vector embeddings of article chunks. This provides the semantic search capabilities for the AI article indexing system, enabling similarity-based article recommendations.
Reference Documents:
- Phase 1 Implementation Plan:
docs/ai/phase1-implementation.rst
Requirements & Acceptance Criteria
- Install and configure ChromaDB with persistent storage
- Setup collection management for article embeddings
- Implement vector storage operations for article chunks
- Create similarity search functionality with configurable parameters
- Implement collection management and maintenance operations
- Document backup and recovery procedures for vector data
- Create comprehensive unit tests for all vector operations
- Setup monitoring for vector database performance
Implementation Steps
1. ChromaDB Installation and Configuration
- Add ChromaDB to project requirements (version >= 0.4.0)
- Create
knowledge/services/vector_db.pywith VectorDatabase class - Initialize ChromaDB persistent client with the following settings:
- Persist Directory: Configurable storage path (default: "./chroma_db")
- Anonymized Telemetry: Disabled for privacy
- Allow Reset: Disabled in production, enabled for development
- Collection Name: "article_chunks" for legal article chunks
- Collection Metadata: Description and configuration information
2. Vector Storage Operations
Create vector storage functionality with the following methods:
- Single Embedding Storage: Store individual chunk embedding with metadata
- Accept chunk_id, embedding vector, content text, and metadata dictionary
- Store in ChromaDB collection with proper ID mapping
- Batch Embedding Storage: Store multiple embeddings efficiently
- Process arrays of embeddings, documents, metadata, and IDs
- Use ChromaDB batch add operation for performance
- Update Operations: Update existing chunk embeddings
- Support updating embedding, content, and metadata for existing chunks
- Delete Operations: Remove embeddings from the collection
- Single chunk deletion by chunk ID
- Bulk deletion for all chunks of a specific article
- Support metadata-based filtering for deletions
3. Similarity Search Implementation
Implement similarity search functionality with the following features:
- Basic Similarity Search: Query collection with embedding vector
- Accept query embedding, result count, minimum similarity threshold
- Support metadata filters for refined search results
- Return results with similarity scores, content, and metadata
- Advanced Search Options: Enhanced search capabilities
- Configurable distance metrics (cosine similarity default)
- Result filtering by minimum similarity threshold
- Metadata-based filtering (article_id, categories, tags, date ranges)
- Specialized Search Methods: Domain-specific search functionality
- Legal case recommendation system with appropriate similarity thresholds
- Article-to-article similarity for related content discovery
- Category-based search with legal domain filtering
4. Collection Management and Maintenance
Implement collection management with the following capabilities:
- Collection Statistics: Monitor collection health and usage
- Total chunk count, collection metadata, last update timestamps
- Performance metrics and storage usage information
- Maintenance Operations: Collection optimization and cleanup
- Collection reset functionality (development/testing only)
- Orphaned embedding cleanup for deleted articles
- Collection optimization for improved query performance
- Backup and Recovery: Data protection and disaster recovery
- Collection backup to external storage with timestamps
- Restore functionality from backup files
- Incremental backup support for large collections
5. Django Integration Service
Create knowledge/services/embedding_service.py with Django-specific functionality:
- Django Settings Integration: Use Django configuration for ChromaDB paths
- Model Integration: Work with Article and ArticleChunk Django models
- Article Chunk Storage: Store all chunks for a complete article
- Extract embeddings from ArticleChunk models
- Include comprehensive metadata: article_id, title, chunk_index, timestamps, categories, tags
- Use batch operations for efficient storage of multiple chunks
- Search Interface: Provide Django-friendly search interface
- Query method accepting text or embeddings
- Return Django-compatible result objects
- Integration with search logging and analytics
6. Management Commands for Vector Operations
Create knowledge/management/commands/index_articles.py with the following features:
- Command Arguments: Flexible indexing options
--article-id: Index specific article by ID (optional)--reset: Reset collection before indexing (development flag)--batch-size: Configurable batch size for processing (default 50)
- Processing Integration: Use ProcessingLog for monitoring and tracking
- Batch Processing: Efficient processing of large article collections
- Query articles with existing embeddings
- Process in configurable batches to manage memory usage
- Track progress and performance metrics
- Error Handling: Comprehensive error handling with logging
- Integration with Sentry for error tracking
- Processing log updates for success/failure states
- Graceful handling of individual article failures
Code Changes Required
- Create
knowledge/services/vector_db.pywith ChromaDB client - Create
knowledge/services/embedding_service.pyfor Django integration - Add ChromaDB configuration to Django settings
- Create management commands for vector operations
- Update Article and ArticleChunk models with vector-related methods
- Add comprehensive unit tests for vector operations
- Create integration tests for search functionality
External Documentation
Deliverables
- ChromaDB integration with persistent storage
- Vector storage and retrieval operations
- Similarity search functionality
- Collection management and maintenance tools
- Django service integration layer
- Management commands for vector operations
- Comprehensive test suite for vector operations
- Backup and recovery procedures documentation
- Performance monitoring and optimization guidelines
Performance Requirements
- Store embeddings for 10,000+ article chunks
- Similarity search response time < 500ms for 10 results
- Batch operations should handle 100+ embeddings efficiently
- Collection should support concurrent read operations
- Backup operations should complete within 10 minutes
Next Steps
- Upon completion, enable [AI] OpenAI Embedding Generation System #1808 (OpenAI Embedding Generation)
- Enable [AI] Search API Development #1809 (Search API Development)
- Schedule performance testing with infrastructure team
Metadata
Metadata
Assignees
Labels
No labels