A comprehensive Python application that converts raw PDF documents into vectorized data for AI-powered document retrieval and search.
- PDF Upload & Processing: Web interface for uploading PDF documents
- Text Extraction: Robust text extraction from PDFs using PyMuPDF
- Intelligent Chunking: Smart text segmentation using LangChain's text splitters
- Vector Embeddings: Generate embeddings using Google's text-embedding-004 model
- Vector Storage: Persistent storage using ChromaDB
- Semantic Search: Find similar content using vector similarity search
- RESTful API: Complete API for document processing and search
- Web Interface: Simple HTML interface for easy document uploads
- Python 3.8 or higher
- Google AI API key (for embedding generation)
-
Clone the repository:
git clone https://github.com/Lokesh1028/Document-QA-RAG.git cd Document-QA-RAG -
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
cp config.example .env # Edit .env file and add your Google AI API key -
Get Google AI API Key:
- Visit Google AI Studio
- Create a new API key
- Add it to your
.envfile
-
Start the application:
python main.py
Or using uvicorn directly:
uvicorn main:app --reload
-
Access the web interface: Open your browser and go to
http://localhost:8000 -
Upload a PDF:
- Use the web interface to upload a PDF document
- The system will automatically process and vectorize the document
-
Search documents:
- Use the
/search/endpoint to find similar content - Example:
http://localhost:8000/search/?query=your search query&limit=5
- Use the
- Description: Web interface for document uploads
- Response: HTML upload page
- Description: Upload and process a PDF document
- Parameters:
file: PDF file (multipart/form-data)
- Response: Processing result with document ID and statistics
- Description: Search for similar document content
- Parameters:
query: Search query textlimit: Maximum number of results (default: 5)
- Response: List of similar document chunks with metadata
- Description: Get system status and configuration
- Response: System configuration and database statistics
- Description: Health check endpoint
- Response: System health status
Upload a PDF:
curl -X POST "http://localhost:8000/upload-pdf/" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "[email protected]"Search documents:
curl -X GET "http://localhost:8000/search/?query=machine learning&limit=3"The system follows a pipeline architecture:
PDF Upload → Text Extraction → Text Chunking → Embedding Generation → Vector Storage
↓ ↓ ↓ ↓ ↓
FastAPI PyMuPDF LangChain Google AI API ChromaDB
- Web Layer: FastAPI application with upload endpoints
- Text Processing: PyMuPDF for PDF text extraction
- Chunking: LangChain's RecursiveCharacterTextSplitter
- Embedding: Google's text-embedding-004 model
- Storage: ChromaDB for vector storage and retrieval
Environment variables (configure in .env file):
| Variable | Description | Default |
|---|---|---|
GOOGLE_API_KEY |
Google AI API key (required) | None |
CHROMADB_PATH |
Path to ChromaDB storage | ./vector_db |
COLLECTION_NAME |
ChromaDB collection name | pdf_documents |
CHUNK_SIZE |
Text chunk size in characters | 1000 |
CHUNK_OVERLAP |
Overlap between chunks | 100 |
When you upload a PDF, the system:
- Validates the file type and saves it temporarily
- Extracts text from all pages using PyMuPDF
- Chunks the text into manageable pieces with overlap
- Generates vector embeddings for each chunk using Google AI
- Stores the vectors and metadata in ChromaDB
- Returns processing statistics and document ID
The search functionality:
- Converts your query into a vector embedding
- Searches the vector database for similar content
- Returns the most relevant document chunks
- Includes metadata and similarity scores
Run in development mode:
uvicorn main:app --reload --host 0.0.0.0 --port 8000Check system status:
curl http://localhost:8000/status/Currently supports:
- PDF documents (.pdf)
Common Issues:
-
Missing Google API Key:
- Error: "Google API key not configured"
- Solution: Set
GOOGLE_API_KEYin your.envfile
-
ChromaDB Initialization Failed:
- Check write permissions for the database directory
- Ensure sufficient disk space
-
PDF Processing Errors:
- Verify the PDF is not corrupted
- Check if the PDF is password-protected
-
Out of Memory:
- Reduce
CHUNK_SIZEfor large documents - Process smaller files or increase system memory
- Reduce
- FastAPI: Modern web framework for building APIs
- PyMuPDF: PDF text extraction
- LangChain: Text processing and chunking
- ChromaDB: Vector database for storage and retrieval
- Google Generative AI: Embedding generation
- python-dotenv: Environment variable management
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is open source. Please check the license file for more details.
For support and questions:
- Check the troubleshooting section
- Review the API documentation
- Open an issue on the repository
Built with ❤️ using Python, FastAPI, and modern AI technologies