A comprehensive AI-powered system for biomedical literature analysis, semantic search, summarization, and trend discovery.
- π Semantic Search: AI-powered search through biomedical literature
- π Auto-Summarization: Generate concise summaries of research papers
- π Trend Analysis: Discover trending topics and research patterns
- π― Topic Modeling: Identify and analyze research themes
- π Web Dashboard: Interactive Streamlit interface
- π REST API: FastAPI backend for integration
- π Visualizations: Charts and graphs for trend analysis
- Researchers: Quickly find relevant papers and identify research gaps
- Clinicians: Stay updated with latest medical research
- Students: Understand research trends and topics
- Data Scientists: Analyze patterns in biomedical literature
- Institutions: Monitor research output and collaborations
# Clone or download the project
cd biomedical-research-assistant
# Create virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt
# Copy configuration template
copy .env.template .env # Windows
# cp .env.template .env # Linux/Mac
# Edit .env with your email (required for PubMed API)
# [email protected]
# Check configuration
python main.py check
# Set up data pipeline (30-60 minutes first time)
python main.py setup
# Start API server
python main.py server
# Start web dashboard (in another terminal)
python main.py dashboard
Open your browser to http://localhost:8501
for the dashboard!
See SETUP_INSTRUCTIONS.md for complete setup guide.
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Data Sources β β AI Processing β β User Interfaceβ
βββββββββββββββββββ€ ββββββββββββββββββββ€ βββββββββββββββββββ€
β β’ PubMed API βββββΆβ β’ Text Cleaning βββββΆβ β’ Web Dashboard β
β β’ Research Papersβ β β’ Embeddings β β β’ REST API β
β β’ Metadata β β β’ Summarization β β β’ Visualizationsβ
β β’ MeSH Terms β β β’ Topic Modeling β β β’ Search Resultsβ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
- Data Ingestion: Fetches papers from PubMed using Entrez API
- Preprocessing: Cleans and structures text data
- Embeddings: Creates semantic vectors using Sentence Transformers
- Indexing: Builds FAISS index for fast similarity search
- Embeddings:
sentence-transformers/all-mpnet-base-v2
- Summarization:
facebook/bart-large-cnn
- Topic Modeling: BERTopic with biomedical optimizations
- API Server: FastAPI with auto-generated documentation
- Web Dashboard: Streamlit with interactive visualizations
- CLI Tools: Command-line interface for all operations
- "COVID-19 vaccine efficacy clinical trials"
- "cancer immunotherapy checkpoint inhibitors"
- "alzheimer disease biomarkers tau protein"
- "diabetes treatment metformin mechanism"
- "machine learning medical imaging"
RESEARCH_DOMAIN=covid immunotherapy
MAX_PAPERS=5000
DATE_FROM=2020/01/01
DATE_TO=2024/12/31
# Standard (fast)
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
# Biomedical (accurate)
EMBEDDING_MODEL=microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
# For testing
MAX_PAPERS=1000
TOP_K_RESULTS=10
# For production
MAX_PAPERS=10000
TOP_K_RESULTS=20
Dataset Size | Setup Time | Search Speed | Memory Usage |
---|---|---|---|
1K papers | 5-10 min | <100ms | 2-4 GB |
5K papers | 20-30 min | <200ms | 4-8 GB |
10K+ papers | 45-60 min | <300ms | 8-16 GB |
GET /search?q=covid+vaccine&top_k=10
POST /search
GET /summarize?q=covid+vaccine&top_k=5
GET /paper/{pmid}/summary
GET /topics/trending?top_k=10
GET /topics/{topic_id}
GET /trends/general
GET /paper/{pmid}
GET /paper/{pmid}/similar
- Semantic search with similarity scores
- Multi-paper summarization
- Similar paper recommendations
- Interactive result filtering
- Real-time trending topic identification
- Growth rate analysis
- Topic evolution over time
- Representative paper extraction
- Publication trends by year/month
- Journal analysis and rankings
- Author collaboration patterns
- MeSH term frequency analysis
- Individual paper summaries
- Citation-style information
- Related paper discovery
- Metadata extraction
- No Personal Data: Only public research metadata is processed
- Medical Disclaimer: For research purposes only, not medical advice
- Rate Limiting: Respects PubMed API rate limits
- Open Source: Transparent algorithms and processing
Minimum:
- Python 3.8+
- 8GB RAM
- 5GB storage
- Internet connection
Recommended:
- Python 3.9+
- 16GB+ RAM
- 20GB+ storage
- GPU (optional, for faster processing)
π Data Source: All data comes from publicly available research papers via PubMed/NCBI APIs.
π Updates: The system processes research papers available up to your search date range. For the most current research, regularly update your dataset.
- Issues: Report bugs and request features via GitHub issues
- Documentation: See setup guide and API documentation
- Community: Join discussions and share improvements
- Contributing: Pull requests welcome for new features and fixes
This project is open source. See LICENSE file for details.
- NCBI/PubMed for providing access to biomedical literature
- Hugging Face for transformer models and libraries
- Streamlit & FastAPI for web framework components
- Scientific Community for open access research
Built with β€οΈ for the research community
Empowering discovery through AI-driven literature analysis