A Streamlit application that demonstrates semantic chunking of text using sentence embeddings and similarity search. The app splits text into sentences, creates embeddings using Sentence Transformers, and visualizes semantic relationships between sentences.
- Interactive text input
- Adjustable similarity threshold
- Visualization of semantic relationships between sentences
- Automatic sentence clustering based on semantic similarity
- Interactive visualization with hover details
- Create a new Python virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the required packages:
pip install -r requirements.txt
- streamlit: Web application framework
- sentence-transformers: For generating sentence embeddings
- qdrant-client: Vector similarity search
- nltk: Natural language processing tools
- plotly: Interactive visualizations
- scikit-learn: For dimensionality reduction (t-SNE)
- numpy: Numerical computations
- torch: Required by sentence-transformers
- Start the Streamlit app:
streamlit run app.py
-
Enter your text in the text area or use the provided sample text
-
Adjust the similarity threshold slider:
- Higher values (e.g., 0.8-0.9) create more granular chunks
- Lower values (e.g., 0.5-0.7) create larger, more inclusive chunks
-
Click "Process Text" to analyze and visualize the semantic relationships
-
Explore the visualization:
- Each point represents a sentence
- Connected points are semantically similar sentences
- Hover over points to see the full sentence text
- Colors indicate different semantic clusters
-
Text Processing:
- Splits input text into sentences using NLTK
- Generates embeddings using the all-MiniLM-L6-v2 model
- Stores embeddings in Qdrant for similarity search
-
Chunking:
- Uses cosine similarity to find related sentences
- Groups sentences based on similarity threshold
- Creates chunks of semantically related sentences
-
Visualization:
- Reduces embedding dimensionality using t-SNE
- Creates interactive plot using Plotly
- Shows relationships between sentences through connections
- Currently uses in-memory storage (not persistent)
- Limited to processing text that fits in memory
- Visualization may become cluttered with very large texts
- Requires manual tuning of similarity threshold
- Add persistent storage for embeddings
- Implement more sophisticated chunking algorithms
- Add support for document upload
- Improve visualization for large texts
- Add export functionality for chunks
- Add support for multiple languages
Feel free to submit issues and enhancement requests!