Document Vectorization System

A comprehensive Python application that converts raw PDF documents into vectorized data for AI-powered document retrieval and search.

🚀 Features

PDF Upload & Processing: Web interface for uploading PDF documents
Text Extraction: Robust text extraction from PDFs using PyMuPDF
Intelligent Chunking: Smart text segmentation using LangChain's text splitters
Vector Embeddings: Generate embeddings using Google's text-embedding-004 model
Vector Storage: Persistent storage using ChromaDB
Semantic Search: Find similar content using vector similarity search
RESTful API: Complete API for document processing and search
Web Interface: Simple HTML interface for easy document uploads

📋 Prerequisites

Python 3.8 or higher
Google AI API key (for embedding generation)

🛠️ Installation

Clone the repository:

git clone https://github.com/Lokesh1028/Document-QA-RAG.git
cd Document-QA-RAG

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables:

cp config.example .env
# Edit .env file and add your Google AI API key

Get Google AI API Key:
- Visit Google AI Studio
- Create a new API key
- Add it to your .env file

🚦 Quick Start

Start the application:
```
python main.py
```
Or using uvicorn directly:
```
uvicorn main:app --reload
```
Access the web interface: Open your browser and go to http://localhost:8000
Upload a PDF:
- Use the web interface to upload a PDF document
- The system will automatically process and vectorize the document
Search documents:
- Use the /search/ endpoint to find similar content
- Example: http://localhost:8000/search/?query=your search query&limit=5

📖 API Documentation

Endpoints

`GET /`

Description: Web interface for document uploads
Response: HTML upload page

`POST /upload-pdf/`

Description: Upload and process a PDF document
Parameters:
- file: PDF file (multipart/form-data)
Response: Processing result with document ID and statistics

`GET /search/`

Description: Search for similar document content
Parameters:
- query: Search query text
- limit: Maximum number of results (default: 5)
Response: List of similar document chunks with metadata

`GET /status/`

Description: Get system status and configuration
Response: System configuration and database statistics

`GET /health/`

Description: Health check endpoint
Response: System health status

Example Usage

Upload a PDF:

curl -X POST "http://localhost:8000/upload-pdf/" \
     -H "accept: application/json" \
     -H "Content-Type: multipart/form-data" \
     -F "[email protected]"

Search documents:

curl -X GET "http://localhost:8000/search/?query=machine learning&limit=3"

🏗️ Architecture

The system follows a pipeline architecture:

PDF Upload → Text Extraction → Text Chunking → Embedding Generation → Vector Storage
     ↓              ↓              ↓                    ↓                ↓
  FastAPI      PyMuPDF      LangChain        Google AI API      ChromaDB

Components

Web Layer: FastAPI application with upload endpoints
Text Processing: PyMuPDF for PDF text extraction
Chunking: LangChain's RecursiveCharacterTextSplitter
Embedding: Google's text-embedding-004 model
Storage: ChromaDB for vector storage and retrieval

⚙️ Configuration

Environment variables (configure in .env file):

Variable	Description	Default
`GOOGLE_API_KEY`	Google AI API key (required)	None
`CHROMADB_PATH`	Path to ChromaDB storage	`./vector_db`
`COLLECTION_NAME`	ChromaDB collection name	`pdf_documents`
`CHUNK_SIZE`	Text chunk size in characters	`1000`
`CHUNK_OVERLAP`	Overlap between chunks	`100`

🔍 Processing Pipeline

When you upload a PDF, the system:

Validates the file type and saves it temporarily
Extracts text from all pages using PyMuPDF
Chunks the text into manageable pieces with overlap
Generates vector embeddings for each chunk using Google AI
Stores the vectors and metadata in ChromaDB
Returns processing statistics and document ID

📊 Vector Search

The search functionality:

Converts your query into a vector embedding
Searches the vector database for similar content
Returns the most relevant document chunks
Includes metadata and similarity scores

🛠️ Development

Run in development mode:

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Check system status:

curl http://localhost:8000/status/

📝 Supported File Types

Currently supports:

PDF documents (.pdf)

🔧 Troubleshooting

Common Issues:

Missing Google API Key:
- Error: "Google API key not configured"
- Solution: Set GOOGLE_API_KEY in your .env file
ChromaDB Initialization Failed:
- Check write permissions for the database directory
- Ensure sufficient disk space
PDF Processing Errors:
- Verify the PDF is not corrupted
- Check if the PDF is password-protected
Out of Memory:
- Reduce CHUNK_SIZE for large documents
- Process smaller files or increase system memory

📚 Dependencies

FastAPI: Modern web framework for building APIs
PyMuPDF: PDF text extraction
LangChain: Text processing and chunking
ChromaDB: Vector database for storage and retrieval
Google Generative AI: Embedding generation
python-dotenv: Environment variable management

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

This project is open source. Please check the license file for more details.

🆘 Support

For support and questions:

Check the troubleshooting section
Review the API documentation
Open an issue on the repository

Built with ❤️ using Python, FastAPI, and modern AI technologies

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
vector_db		vector_db
.env		.env
README.md		README.md
check_setup.py		check_setup.py
config.example		config.example
demo.py		demo.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Vectorization System

🚀 Features

📋 Prerequisites

🛠️ Installation

🚦 Quick Start

📖 API Documentation

Endpoints

`GET /`

`POST /upload-pdf/`

`GET /search/`

`GET /status/`

`GET /health/`

Example Usage

🏗️ Architecture

Components

⚙️ Configuration

🔍 Processing Pipeline

📊 Vector Search

🛠️ Development

📝 Supported File Types

🔧 Troubleshooting

📚 Dependencies

🤝 Contributing

📄 License

🆘 Support

About

Uh oh!

Releases

Packages

Languages

Lokesh1028/Document-QA-RAG

Folders and files

Latest commit

History

Repository files navigation

Document Vectorization System

🚀 Features

📋 Prerequisites

🛠️ Installation

🚦 Quick Start

📖 API Documentation

Endpoints

GET /

POST /upload-pdf/

GET /search/

GET /status/

GET /health/

Example Usage

🏗️ Architecture

Components

⚙️ Configuration

🔍 Processing Pipeline

📊 Vector Search

🛠️ Development

📝 Supported File Types

🔧 Troubleshooting

📚 Dependencies

🤝 Contributing

📄 License

🆘 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`GET /`

`POST /upload-pdf/`

`GET /search/`

`GET /status/`

`GET /health/`

Packages