Skip to content

myriel-io/qdrant-text-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

PDF Document Search Engine

A semantic search engine built with Streamlit and Qdrant that allows users to upload PDF documents, automatically chunks them, and enables semantic search through the content using embeddings.

Features

  • PDF document upload and processing
  • Automatic text extraction and chunking
  • Semantic search using sentence transformers
  • In-memory vector database with Qdrant
  • Interactive web interface
  • Real-time similarity scores
  • Expandable search results
  • Document source tracking

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/pdf-search-engine.git
cd pdf-search-engine
  1. Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate
  1. Install the required dependencies:
pip install -r requirements.txt

Usage

  1. Start the application:
streamlit run app.py
  1. Open your web browser and navigate to the URL shown in the terminal (typically http://localhost:8501)

  2. Using the application:

    • Upload PDF documents using the sidebar
    • Wait for the document to be processed (chunked and embedded)
    • Enter search text in the main area
    • Adjust the number of results using the slider
    • Click "Search" to find relevant text chunks
    • View results with similarity scores and expand to see full content

Technical Details

Components

  • Frontend: Streamlit
  • Vector Database: Qdrant (in-memory mode)
  • Text Embeddings: SentenceTransformers (all-MiniLM-L6-v2)
  • PDF Processing: pypdf
  • Vector Size: 384 dimensions
  • Similarity Metric: Cosine similarity

Text Processing

  • Automatic text extraction from PDFs
  • Intelligent text chunking with customizable:
    • Chunk size (default: 1000 characters)
    • Overlap between chunks (default: 100 characters)
    • Smart break points at sentence boundaries

Search Features

  • Semantic similarity search
  • Configurable number of results
  • Results include:
    • Similarity scores
    • Source document information
    • Chunk index
    • Expandable text content

Project Structure

pdf-search-engine/
├── app.py              # Main application file
├── requirements.txt    # Python dependencies
└── README.md          # Documentation

Limitations

  • In-memory database (data not persisted between restarts)
  • Single PDF processing at a time
  • Limited to text content from PDFs
  • No authentication or user management
  • Basic error handling
  • No parallel processing for large documents

Future Improvements

Potential enhancements:

  • Persistent storage using Qdrant's file mode
  • Batch PDF uploads
  • Support for other document formats (DOCX, TXT, etc.)
  • Advanced search filters
  • User authentication
  • Progress bar for document processing
  • Parallel processing for large documents
  • Custom chunk size configuration in UI
  • Export search results
  • Document metadata extraction
  • OCR support for scanned PDFs

Contributing

Feel free to submit issues, fork the repository, and create pull requests for any improvements.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages