Skip to content

srijanshukla18/ita-kg

Repository files navigation

Income Tax Act Knowledge Graph + RAG System

A hybrid system combining Knowledge Graphs and Retrieval-Augmented Generation (RAG) for intelligent querying of the Indian Income Tax Act.

Why This Approach?

Traditional RAG systems struggle with legal documents because they miss the interconnected nature of legal provisions. This system solves that by:

RAG Alone Fails At:

  • "What sections reference Section 80C?"
  • "Show me all exemptions available for senior citizens"
  • "What penalties apply if I violate Section 44AD?"
  • "How does Section 10 relate to Section 80?"

Knowledge Graph Excels Because Tax Law Has:

  • Sections that reference other sections
  • Definitions used across multiple places
  • Conditions and thresholds (income slabs, age limits)
  • Exemptions with eligibility criteria

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Tax Act Text  │───▶│  Parser Module   │───▶│  Knowledge Graph│
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        │
                       ┌──────────────────┐              │
                       │  Vector Store    │              │
                       │     (RAG)        │              │
                       └──────────────────┘              │
                                │                        │
                                ▼                        ▼
                       ┌──────────────────────────────────────┐
                       │     Hybrid Query System              │
                       │  • KG Queries (relationships)       │
                       │  • RAG Queries (content)            │
                       │  • Hybrid Queries (both)            │
                       └──────────────────────────────────────┘

Quick Start

Prerequisites

  • Python 3.8+
  • Neo4j Database (local or cloud)
  • OpenAI API Key (optional, for enhanced responses)

Installation

Option 1: Using Makefile (Recommended)

git clone <repository>
cd ita-kg

# Setup virtual environment and install dependencies
make setup

# Configure environment
cp .env.example .env
# Edit .env with your credentials

# Run with virtual environment
make run

# Or run demo
make demo

Option 2: Using Docker

git clone <repository>
cd ita-kg

# Configure environment
cp .env.example .env
# Edit .env with your credentials

# Start services (Neo4j + app)
make docker-up

# View logs
make docker-logs

Option 3: Manual Setup

git clone <repository>
cd ita-kg
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Setup Neo4j manually
docker run --name neo4j -p7474:7474 -p7687:7687 -d \
    -e NEO4J_AUTH=neo4j/your_password \
    neo4j:latest

python main.py

Available Make Commands

make help          # Show all available commands
make setup          # Create venv and install dependencies
make run            # Run main.py using virtual environment
make demo           # Run demo.py using virtual environment
make docker-up      # Start Docker services
make docker-down    # Stop Docker services
make clean          # Remove venv and Docker volumes

Usage Examples

Knowledge Graph Queries (Relationships)

# What sections reference Section 80C?
response = query_system.query("What sections reference Section 80C?")
# Returns: Sections that reference Section 80C:
# • Section 80TTB: Deduction in respect of interest on deposits...

# Find related sections
response = query_system.query("What sections are related to Section 139?")

Hybrid Queries (Structure + Content)

# Eligibility questions
response = query_system.query("What exemptions are available for senior citizens?")
# Combines: KG to find exemption sections + RAG for senior citizen content

# Category queries  
response = query_system.query("What deductions are available?")

RAG Queries (Content-Based)

# Detailed explanations
response = query_system.query("Explain Section 44AD for presumptive taxation")

# Specific information
response = query_system.query("What is the penalty for not filing returns?")

Project Structure

ita-kg/
├── tax_parser.py              # Income Tax Act text parser
├── knowledge_graph.py         # Neo4j Knowledge Graph builder
├── hybrid_query_system.py     # Query routing and processing
├── main.py                    # Interactive system
├── demo.py                    # Capabilities demonstration
├── sample_income_tax_act.txt  # Sample tax act data
├── requirements.txt           # Python dependencies
├── .env.example              # Environment template
└── README.md                 # This file

Core Components

1. Tax Parser (tax_parser.py)

  • Extracts sections, titles, and content
  • Identifies cross-references between sections
  • Classifies section types (exemption, deduction, penalty)
  • Extracts key concepts and definitions

2. Knowledge Graph (knowledge_graph.py)

  • Creates Neo4j nodes for sections
  • Builds REFERENCES relationships
  • Adds concept categorization
  • Provides graph analytics

3. Hybrid Query System (hybrid_query_system.py)

  • Routes queries based on type:
    • KG: Reference/relationship queries
    • RAG: Content/explanation queries
    • Hybrid: Complex eligibility queries
  • Combines results for comprehensive answers

Interactive Features

Query Types Supported:

  1. Reference Tracking: "What references Section X?"
  2. Relationship Discovery: "What sections are related to X?"
  3. Category Queries: "Show all deduction sections"
  4. Eligibility Analysis: "What exemptions for senior citizens?"
  5. Content Explanation: "Explain presumptive taxation"
  6. Impact Analysis: "If Section X changes, what's affected?"

Graph Analytics:

  • Section count by type
  • Cross-reference statistics
  • Concept distribution
  • Reference network analysis

Sample Queries

The system comes with sample data covering key sections:

  • Exemptions: Sections 10, 10A
  • Deductions: Sections 80C, 80D, 80TTB
  • Penalties: Sections 271F, 271B
  • Procedures: Sections 139, 44AD
  • Definitions: Section 2

Try these queries:

• "What sections reference Section 80C?"
• "What exemptions are available for senior citizens?"
• "Show me all penalty sections"
• "What is Section 44AD about?"
• "Which sections mention agricultural income?"

Key Advantages

  1. Cross-Reference Navigation: Navigate the web of legal references
  2. Structured Categorization: Find all exemptions/deductions instantly
  3. Impact Analysis: See what's affected when sections change
  4. Context-Aware Responses: Combine structure with content
  5. Scalable: Add more legal documents to the same graph

Configuration

Environment Variables (.env)

NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j  
NEO4J_PASSWORD=your_password
OPENAI_API_KEY=your_api_key  # Optional

Adding More Data

  1. Add sections to sample_income_tax_act.txt
  2. Follow the format: Section X - Title
  3. The parser will automatically extract references and relationships
  4. Run the system to rebuild the knowledge graph

Troubleshooting

Neo4j Connection Issues:

# Check if Neo4j is running
curl http://localhost:7474

# Verify credentials in .env file
# Make sure bolt port (7687) is accessible

Import Errors:

# Reinstall dependencies
pip install -r requirements.txt

# Check Python version (3.8+ required)
python --version

Query Issues:

  • Check Neo4j database has data: MATCH (n) RETURN count(n)
  • Verify section format in source text
  • Check logs for parsing errors

Performance

  • Graph Build: ~1-2 seconds for sample data
  • Query Response: ~100-500ms average
  • Memory Usage: ~50MB for sample dataset
  • Scalability: Tested up to 1000+ sections

Future Enhancements

  • Add more legal documents (Companies Act, GST Act)
  • Enhanced NLP for better reference extraction
  • Web interface with graph visualization
  • Multi-language support
  • Advanced analytics and insights

License

MIT License - Feel free to use for educational and commercial purposes.

About

Income Tax Act Knowledge Graph + RAG System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published