PDF Document Search Engine

A semantic search engine built with Streamlit and Qdrant that allows users to upload PDF documents, automatically chunks them, and enables semantic search through the content using embeddings.

Features

PDF document upload and processing
Automatic text extraction and chunking
Semantic search using sentence transformers
In-memory vector database with Qdrant
Interactive web interface
Real-time similarity scores
Expandable search results
Document source tracking

Prerequisites

Python 3.8 or higher
pip (Python package installer)

Installation

Clone the repository:

git clone https://github.com/yourusername/pdf-search-engine.git
cd pdf-search-engine

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

Install the required dependencies:

pip install -r requirements.txt

Usage

Start the application:

streamlit run app.py

Open your web browser and navigate to the URL shown in the terminal (typically http://localhost:8501)
Using the application:
- Upload PDF documents using the sidebar
- Wait for the document to be processed (chunked and embedded)
- Enter search text in the main area
- Adjust the number of results using the slider
- Click "Search" to find relevant text chunks
- View results with similarity scores and expand to see full content

Technical Details

Components

Frontend: Streamlit
Vector Database: Qdrant (in-memory mode)
Text Embeddings: SentenceTransformers (all-MiniLM-L6-v2)
PDF Processing: pypdf
Vector Size: 384 dimensions
Similarity Metric: Cosine similarity

Text Processing

Automatic text extraction from PDFs
Intelligent text chunking with customizable:
- Chunk size (default: 1000 characters)
- Overlap between chunks (default: 100 characters)
- Smart break points at sentence boundaries

Search Features

Semantic similarity search
Configurable number of results
Results include:
- Similarity scores
- Source document information
- Chunk index
- Expandable text content

Project Structure

pdf-search-engine/
├── app.py              # Main application file
├── requirements.txt    # Python dependencies
└── README.md          # Documentation

Limitations

In-memory database (data not persisted between restarts)
Single PDF processing at a time
Limited to text content from PDFs
No authentication or user management
Basic error handling
No parallel processing for large documents

Future Improvements

Potential enhancements:

Persistent storage using Qdrant's file mode
Batch PDF uploads
Support for other document formats (DOCX, TXT, etc.)
Advanced search filters
User authentication
Progress bar for document processing
Parallel processing for large documents
Custom chunk size configuration in UI
Export search results
Document metadata extraction
OCR support for scanned PDFs

Contributing

Feel free to submit issues, fork the repository, and create pull requests for any improvements.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Qdrant for the vector database
Streamlit for the web interface
SentenceTransformers for text embeddings
pypdf for PDF processing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Document Search Engine

Features

Prerequisites

Installation

Usage

Technical Details

Components

Text Processing

Search Features

Project Structure

Limitations

Future Improvements

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

myriel-io/qdrant-text-search

Folders and files

Latest commit

History

Repository files navigation

PDF Document Search Engine

Features

Prerequisites

Installation

Usage

Technical Details

Components

Text Processing

Search Features

Project Structure

Limitations

Future Improvements

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages