A semantic search engine built with Streamlit and Qdrant that allows users to upload PDF documents, automatically chunks them, and enables semantic search through the content using embeddings.
- PDF document upload and processing
- Automatic text extraction and chunking
- Semantic search using sentence transformers
- In-memory vector database with Qdrant
- Interactive web interface
- Real-time similarity scores
- Expandable search results
- Document source tracking
- Python 3.8 or higher
- pip (Python package installer)
- Clone the repository:
git clone https://github.com/yourusername/pdf-search-engine.git
cd pdf-search-engine
- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
- Install the required dependencies:
pip install -r requirements.txt
- Start the application:
streamlit run app.py
-
Open your web browser and navigate to the URL shown in the terminal (typically http://localhost:8501)
-
Using the application:
- Upload PDF documents using the sidebar
- Wait for the document to be processed (chunked and embedded)
- Enter search text in the main area
- Adjust the number of results using the slider
- Click "Search" to find relevant text chunks
- View results with similarity scores and expand to see full content
- Frontend: Streamlit
- Vector Database: Qdrant (in-memory mode)
- Text Embeddings: SentenceTransformers (all-MiniLM-L6-v2)
- PDF Processing: pypdf
- Vector Size: 384 dimensions
- Similarity Metric: Cosine similarity
- Automatic text extraction from PDFs
- Intelligent text chunking with customizable:
- Chunk size (default: 1000 characters)
- Overlap between chunks (default: 100 characters)
- Smart break points at sentence boundaries
- Semantic similarity search
- Configurable number of results
- Results include:
- Similarity scores
- Source document information
- Chunk index
- Expandable text content
pdf-search-engine/
├── app.py # Main application file
├── requirements.txt # Python dependencies
└── README.md # Documentation
- In-memory database (data not persisted between restarts)
- Single PDF processing at a time
- Limited to text content from PDFs
- No authentication or user management
- Basic error handling
- No parallel processing for large documents
Potential enhancements:
- Persistent storage using Qdrant's file mode
- Batch PDF uploads
- Support for other document formats (DOCX, TXT, etc.)
- Advanced search filters
- User authentication
- Progress bar for document processing
- Parallel processing for large documents
- Custom chunk size configuration in UI
- Export search results
- Document metadata extraction
- OCR support for scanned PDFs
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
This project is licensed under the MIT License - see the LICENSE file for details.
- Qdrant for the vector database
- Streamlit for the web interface
- SentenceTransformers for text embeddings
- pypdf for PDF processing