Semantic Text Chunking with Streamlit

A Streamlit application that demonstrates semantic chunking of text using sentence embeddings and similarity search. The app splits text into sentences, creates embeddings using Sentence Transformers, and visualizes semantic relationships between sentences.

Features

Interactive text input
Adjustable similarity threshold
Visualization of semantic relationships between sentences
Automatic sentence clustering based on semantic similarity
Interactive visualization with hover details

Installation

Create a new Python virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required packages:

pip install -r requirements.txt

Dependencies

streamlit: Web application framework
sentence-transformers: For generating sentence embeddings
qdrant-client: Vector similarity search
nltk: Natural language processing tools
plotly: Interactive visualizations
scikit-learn: For dimensionality reduction (t-SNE)
numpy: Numerical computations
torch: Required by sentence-transformers

Usage

Start the Streamlit app:

streamlit run app.py

Enter your text in the text area or use the provided sample text
Adjust the similarity threshold slider:
- Higher values (e.g., 0.8-0.9) create more granular chunks
- Lower values (e.g., 0.5-0.7) create larger, more inclusive chunks
Click "Process Text" to analyze and visualize the semantic relationships
Explore the visualization:
- Each point represents a sentence
- Connected points are semantically similar sentences
- Hover over points to see the full sentence text
- Colors indicate different semantic clusters

How It Works

Text Processing:
- Splits input text into sentences using NLTK
- Generates embeddings using the all-MiniLM-L6-v2 model
- Stores embeddings in Qdrant for similarity search
Chunking:
- Uses cosine similarity to find related sentences
- Groups sentences based on similarity threshold
- Creates chunks of semantically related sentences
Visualization:
- Reduces embedding dimensionality using t-SNE
- Creates interactive plot using Plotly
- Shows relationships between sentences through connections

Limitations

Currently uses in-memory storage (not persistent)
Limited to processing text that fits in memory
Visualization may become cluttered with very large texts
Requires manual tuning of similarity threshold

Future Improvements

Add persistent storage for embeddings
Implement more sophisticated chunking algorithms
Add support for document upload
Improve visualization for large texts
Add export functionality for chunks
Add support for multiple languages

Contributing

Feel free to submit issues and enhancement requests!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Text Chunking with Streamlit

Features

Installation

Dependencies

Usage

How It Works

Limitations

Future Improvements

Contributing

About

Uh oh!

Releases

Packages

Languages

myriel-io/semantic-chunking

Folders and files

Latest commit

History

Repository files navigation

Semantic Text Chunking with Streamlit

Features

Installation

Dependencies

Usage

How It Works

Limitations

Future Improvements

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages