Skip to content

myriel-io/semantic-chunking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Semantic Text Chunking with Streamlit

A Streamlit application that demonstrates semantic chunking of text using sentence embeddings and similarity search. The app splits text into sentences, creates embeddings using Sentence Transformers, and visualizes semantic relationships between sentences.

Features

  • Interactive text input
  • Adjustable similarity threshold
  • Visualization of semantic relationships between sentences
  • Automatic sentence clustering based on semantic similarity
  • Interactive visualization with hover details

Installation

  1. Create a new Python virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install the required packages:
pip install -r requirements.txt

Dependencies

  • streamlit: Web application framework
  • sentence-transformers: For generating sentence embeddings
  • qdrant-client: Vector similarity search
  • nltk: Natural language processing tools
  • plotly: Interactive visualizations
  • scikit-learn: For dimensionality reduction (t-SNE)
  • numpy: Numerical computations
  • torch: Required by sentence-transformers

Usage

  1. Start the Streamlit app:
streamlit run app.py
  1. Enter your text in the text area or use the provided sample text

  2. Adjust the similarity threshold slider:

    • Higher values (e.g., 0.8-0.9) create more granular chunks
    • Lower values (e.g., 0.5-0.7) create larger, more inclusive chunks
  3. Click "Process Text" to analyze and visualize the semantic relationships

  4. Explore the visualization:

    • Each point represents a sentence
    • Connected points are semantically similar sentences
    • Hover over points to see the full sentence text
    • Colors indicate different semantic clusters

How It Works

  1. Text Processing:

    • Splits input text into sentences using NLTK
    • Generates embeddings using the all-MiniLM-L6-v2 model
    • Stores embeddings in Qdrant for similarity search
  2. Chunking:

    • Uses cosine similarity to find related sentences
    • Groups sentences based on similarity threshold
    • Creates chunks of semantically related sentences
  3. Visualization:

    • Reduces embedding dimensionality using t-SNE
    • Creates interactive plot using Plotly
    • Shows relationships between sentences through connections

Limitations

  • Currently uses in-memory storage (not persistent)
  • Limited to processing text that fits in memory
  • Visualization may become cluttered with very large texts
  • Requires manual tuning of similarity threshold

Future Improvements

  • Add persistent storage for embeddings
  • Implement more sophisticated chunking algorithms
  • Add support for document upload
  • Improve visualization for large texts
  • Add export functionality for chunks
  • Add support for multiple languages

Contributing

Feel free to submit issues and enhancement requests!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages