Security Warning: Never share your
.envfile or API keys. The.envfile is gitignored by default, and sensitive credentials should always be kept private.
This project is a modular, well-documented implementation of the LangChain "Chat With Your Data" tutorial. Each step is a separate script, so you can learn and experiment with each concept locally.
- Document Loading: Load data from PDFs, web pages, and (optionally) YouTube. See
src/load_documents.py. - Text Splitting: Break documents into manageable chunks using different splitters. See
src/split_text.py. - Embeddings: Convert text to vector representations and compare semantic similarity. See
src/embeddings.py. - Vector Stores: Store and retrieve document embeddings efficiently with Chroma. See
src/vector_store.py. - Question Answering: Build QA chains to answer questions about your documents, with a custom prompt. See
src/qa_chain.py.
- Python 3.9+
- An OpenAI API key (add to
.env) - (Optional) A PDF file at
data/test.pdffor PDF loading demo - LangChain v0.1+ and [langchain_community]
- Clone this repo and
cdinto it. - Copy
.env.exampleto.envand add your OpenAI API key and any other required environment variables. - (Optional) Place a PDF at
data/test.pdffor PDF loading. - Install dependencies:
pip install -r requirements.txt
- (Recommended) Install and run ruff for linting and uv for dependency management:
pip install ruff uv ruff check src/ # (Optional) Compile requirements.txt from requirements.in uv pip compile requirements.in --output-file requirements.txt
Scripts must be run in order, as each step saves output for the next. All scripts use utils.py to load environment variables from .env (using python-dotenv).
python src/load_documents.py# Load and preview documents (saves pickles/docs.pkl)- Loads PDF and web documents, prints a preview.
python src/split_text.py# Split documents into chunks (saves pickles/splits.pkl)- Splits documents using RecursiveCharacterTextSplitter and saves the result.
python src/embeddings.py# Generate and compare embeddings- Generates OpenAI embeddings, compares semantic similarity, and embeds document chunks.
python src/vector_store.py# Create and query a vector store (saves to database/)- Loads splits, creates a Chroma vector store, runs a sample query, and persists the DB automatically in the
database/directory.
- Loads splits, creates a Chroma vector store, runs a sample query, and persists the DB automatically in the
python src/qa_chain.py# Run a question-answering chain- Loads the Chroma vector store from
database/, sets up a custom prompt, and answers a sample question using a RetrievalQA chain.
- Loads the Chroma vector store from
Each script is commented for learning. See the source for details and experiment with your own data!
- Intermediate outputs are saved in the
pickles/directory (e.g.,docs.pkl,splits.pkl). - Persistent vector store is saved in the
database/directory (Chroma DB and related files). - Both
pickles/anddatabase/are gitignored and safe to delete if you want to reset the workflow.
- Prompts: You can edit the prompt in
src/qa_chain.pyto change the style or constraints of the answers. - Document Sources: Add more loaders in
src/load_documents.pyas needed (see LangChain docs for options).
- If you see
OPENAI_API_KEY not set in .env file., check your.envfile. - If you get file not found errors, ensure you ran the previous step and the required files exist.
- For PDF loading, make sure
data/test.pdfexists. - Chroma DB is automatically persisted on any change (no need to call persist manually).
Inspired by DeepLearning.AI - LangChain Chat With Your Data