Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
5675e97
Image reader and image support in `gather_evidence` (#1046)
jamesbraza Aug 6, 2025
d1cde22
Preventing Greek name from crashing `DocDetails` creation (#1048)
jamesbraza Aug 6, 2025
0caa926
Better invalid name logs (#1049)
jamesbraza Aug 6, 2025
f71d023
Multimodal PDF support (#1047)
jamesbraza Aug 7, 2025
d3760c9
Fix paperqa/configs link in README.md (#1051)
chrisranderson Aug 8, 2025
948423f
Fixing all broken links, `pymarkdown` (#1052)
jamesbraza Aug 9, 2025
d34543e
Documenting `DocMetadataTask`/`MetadataProvider` (#1050)
jamesbraza Aug 9, 2025
beb838e
chore(deps): lock file maintenance (#1053)
renovate[bot] Aug 9, 2025
81ea858
chore(deps): update actions/checkout action to v5 (#1058)
renovate[bot] Aug 11, 2025
193feb6
chore(deps): update actions/download-artifact action to v5 (#1059)
renovate[bot] Aug 11, 2025
72b7f26
Moved `mypy` to use `local` hook to unsilence it on missing dependenc…
jamesbraza Aug 11, 2025
6801155
Removed dead `patch` from `test_add_clinical_trials_to_docs` (#1056)
jamesbraza Aug 11, 2025
34ceb70
Restored `UnpaywallProvider` by updating expected response (#1057)
jamesbraza Aug 11, 2025
fab944a
Refreshed `test_crossref_journalquality_fields_filtering` cassette (#…
jamesbraza Aug 11, 2025
4862764
Updating `journal_quality.csv` from script (#1061)
jamesbraza Aug 11, 2025
2d58202
Better lookup failure message in `Settings.from_name` (#1064)
jamesbraza Aug 16, 2025
62afb8c
Lower bibtex logging to debug (#1067)
mskarlin Aug 19, 2025
687ce40
Fix: handle non-UTF-8 input in util.py
dmcgrath19 Aug 20, 2025
7a88d42
Add comments to utf-8 error handling
dmcgrath19 Aug 20, 2025
46e21d6
[pre-commit.ci lite] apply automatic fixes
pre-commit-ci-lite[bot] Aug 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@ jobs:
publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- id: build-paper-qa-pymupdf
uses: hynek/build-and-inspect-python-package@v2
with:
path: packages/paper-qa-pymupdf
upload-name-suffix: -paper-qa-pymupdf
- name: Download built paper-qa-pymupdf artifact to dist/
uses: actions/download-artifact@v4
uses: actions/download-artifact@v5
with:
name: ${{ steps.build-paper-qa-pymupdf.outputs.artifact-name }}
path: dist
Expand All @@ -28,7 +28,7 @@ jobs:
path: packages/paper-qa-pypdf
upload-name-suffix: -paper-qa-pypdf
- name: Download built paper-qa-pypdf artifact to dist/
uses: actions/download-artifact@v4
uses: actions/download-artifact@v5
with:
name: ${{ steps.build-paper-qa-pypdf.outputs.artifact-name }}
path: dist
Expand All @@ -39,7 +39,7 @@ jobs:
with:
upload-name-suffix: -paper-qa
- name: Download built paper-qa artifact to dist/
uses: actions/download-artifact@v4
uses: actions/download-artifact@v5
with:
name: ${{ steps.build-paper-qa.outputs.artifact-name }}
path: dist
Expand Down
13 changes: 10 additions & 3 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,19 @@ jobs:
matrix:
python-version: [3.11, 3.13] # Our min and max supported Python versions
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
with:
fetch-depth: 0 # For setuptools-scm, replace with fetch-tags after https://github.com/actions/checkout/issues/1471
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- uses: astral-sh/setup-uv@v6
with:
enable-cache: true
- run: echo "UV_PROJECT_ENVIRONMENT=$(python -c "import sysconfig; print(sysconfig.get_config_var('prefix'))")" >> $GITHUB_ENV
- run: uv python pin ${{ matrix.python-version }} # uv requires .python-version to match OS Python: https://github.com/astral-sh/uv/issues/11389
- run: uv sync --python-preference only-system
- run: git checkout .python-version # For clean git diff given `pre-commit run --show-diff-on-failure`
- uses: pre-commit/[email protected]
- uses: pre-commit-ci/[email protected]
if: always()
Expand All @@ -29,7 +36,7 @@ jobs:
matrix:
python-version: [3.11] # Our min supported Python version
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- uses: astral-sh/setup-uv@v6
with:
enable-cache: true
Expand Down Expand Up @@ -72,7 +79,7 @@ jobs:
matrix:
python-version: [3.11, 3.13] # Our min and max supported Python versions
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- uses: astral-sh/setup-uv@v6
with:
enable-cache: true
Expand Down
43 changes: 11 additions & 32 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ repos:
- id: mixed-line-ending
- id: trailing-whitespace
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.12.2
rev: v0.12.8
hooks:
- id: ruff-check
args: [--fix, --exit-non-zero-on-fix]
Expand Down Expand Up @@ -54,11 +54,11 @@ repos:
hooks:
- id: check-mailmap
- repo: https://github.com/henryiii/validate-pyproject-schema-store
rev: 2025.06.23
rev: 2025.08.07
hooks:
- id: validate-pyproject
- repo: https://github.com/astral-sh/uv-pre-commit
rev: 0.7.19
rev: 0.8.6
hooks:
- id: uv-lock
- repo: https://github.com/adamchainz/blacken-docs
Expand All @@ -71,11 +71,10 @@ repos:
hooks:
- id: nb-clean
args: [--preserve-cell-outputs, --remove-empty-cells]
- repo: https://github.com/igorshubovych/markdownlint-cli
rev: v0.45.0
- repo: https://github.com/jackdewinter/pymarkdown
rev: v0.9.31
hooks:
- id: markdownlint
args: [--config=pyproject.toml, --configPointer=/tool/markdownlint]
- id: pymarkdown
exclude: docs/tutorials/
- repo: https://github.com/mwouts/jupytext
rev: v1.17.2
Expand All @@ -88,30 +87,10 @@ repos:
rev: 0.0.10
hooks:
- id: markdown-toc-creator
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.16.1
- repo: local # Use local so we can inspect paperqa.version
hooks:
- id: mypy
args: [--pretty, --ignore-missing-imports]
additional_dependencies:
- aiohttp>=3.10.6 # Match pyproject.toml
- PyMuPDF>=1.24.12
- anyio
- fhlmi>=0.28 # Match pyproject.toml
- fhaviary[llm]>=0.20 # Match pyproject.toml
- ldp>=0.25.0 # Match pyproject.toml
- html2text
- httpx
- pybtex
- numpy
- pydantic~=2.11 # Match pyproject.toml
- pydantic-settings
- qdrant-client
- rich
- tantivy>=0.22.2 # Match pyproject.toml
- tenacity
- tiktoken>=0.4.0 # Match pyproject.toml
- types-setuptools
- types-PyYAML
- sentence-transformers
- pyzotero
name: mypy
entry: mypy
language: system
types_or: [python, pyi]
33 changes: 30 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# PaperQA2

[![GitHub](https://img.shields.io/badge/github-%23121011.svg?logo=github&logoColor=white)](https://github.com/Future-House/paper-qa)
<!-- pyml disable-num-lines 6 line-length -->

[![GitHub](https://img.shields.io/badge/GitHub-black?logo=github&logoColor=white)](https://github.com/Future-House/paper-qa)
[![PyPI version](https://badge.fury.io/py/paper-qa.svg)](https://badge.fury.io/py/paper-qa)
[![tests](https://github.com/Future-House/paper-qa/actions/workflows/tests.yml/badge.svg)](https://github.com/Future-House/paper-qa)
![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)
Expand Down Expand Up @@ -35,6 +37,7 @@ question answering, summarization, and contradiction detection.
- [Local Embedding Models (Sentence Transformers)](#local-embedding-models-sentence-transformers)
- [Adjusting number of sources](#adjusting-number-of-sources)
- [Using Code or HTML](#using-code-or-html)
- [Multimodal Support](#multimodal-support)
- [Using External DB/Vector DB and Caching](#using-external-dbvector-db-and-caching)
- [Creating Index](#creating-index)
- [Manifest Files](#manifest-files)
Expand Down Expand Up @@ -287,7 +290,7 @@ pqa --settings <setting name> \

### Bundled Settings

Inside [`paperqa/configs`](paperqa/configs) we bundle known useful settings:
Inside [`src/paperqa/configs`](src/paperqa/configs) we bundle known useful settings:

| Setting Name | Description |
| ------------ | ---------------------------------------------------------------------------------------------------------------------------- |
Expand All @@ -307,7 +310,7 @@ For each OpenAI tier, a pre-built setting exists to limit usage.
pqa --settings 'tier1_limits' ask 'What is PaperQA2?'
```

This will limit your system to use the [tier1_limits](paperqa/configs/tier1_limits.json),
This will limit your system to use the [tier1_limits](src/paperqa/configs/tier1_limits.json),
and slow down your queries to accommodate.

You can also specify them manually with any rate limit string that matches the specification in
Expand Down Expand Up @@ -726,6 +729,28 @@ session = await docs.aquery("Where is the search bar in the header defined?")
print(session)
```

### Multimodal Support

Multimodal support centers on:

- Standalone images
- Images or tables in PDFs

The `Docs` object stores media via a `ParsedMedia` object.
When chunking a document, media are not split at chunk boundaries,
so it's possible 2+ chunks can correspond with the same media.
This means within PaperQA each chunk
has a one-to-many relationship between `ParsedMedia` and chunks.

Depending on the source document, the same image can appear multiple times
(e.g. each page of a PDF has a logo in the margins).
Thus, clients should consider media databases
to have a many-to-many relationship with chunks.

When creating contextual summaries on a given chunk (a `Text`),
the summary LLM is passed both the chunk's text and the chunk's associated media,
but the output contextual summary itself remains text-only.

### Using External DB/Vector DB and Caching

You may want to cache parsed texts and embeddings in an external database or file.
Expand Down Expand Up @@ -880,6 +905,7 @@ will return much faster than the first query and we'll be certain the authors ma
| `answer.evidence_retrieval` | `True` | Use retrieval vs processing all docs. |
| `answer.evidence_summary_length` | `"about 100 words"` | Length of evidence summary. |
| `answer.evidence_skip_summary` | `False` | Whether to skip summarization. |
| `answer.evidence_text_only_fallback` | `False` | Whether to allow context creation to retry without media present. |
| `answer.answer_max_sources` | `5` | Max number of sources for an answer. |
| `answer.max_answer_attempts` | `None` | Max attempts to generate an answer. |
| `answer.answer_length` | `"about 200 words, but can be longer"` | Length of final answer. |
Expand All @@ -894,6 +920,7 @@ will return much faster than the first query and we'll be certain the authors ma
| `parsing.pdfs_use_block_parsing` | `False` | Opt-in flag for block-based PDF parsing over text-based PDF parsing. |
| `parsing.use_doc_details` | `True` | Whether to get metadata details for docs. |
| `parsing.overlap` | `250` | Characters to overlap chunks. |
| `parsing.multimodal` | `True` | Flag to parse both text and images from applicable documents. |
| `parsing.defer_embedding` | `False` | Whether to defer embedding until summarization. |
| `parsing.parse_pdf` | `paperqa_pypdf.parse_pdf_to_pages` | Function to parse PDF files. |
| `parsing.configure_pdf_parser` | No-op | Callable to configure the PDF parser within `parse_pdf`, useful for behaviors such as enabling logging. |
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/settings_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -96,9 +96,9 @@
"metadata": {},
"source": [
"The `Settings` class is used to configure the PaperQA settings.\n",
"Official documentation can be found [here](https://github.com/Future-House/paper-qa?tab=readme-ov-file#settings-cheatsheet) and the open source code can be found [here](https://github.com/Future-House/paper-qa/blob/main/paperqa/settings.py).\n",
"Official documentation can be found [here](https://github.com/Future-House/paper-qa?tab=readme-ov-file#settings-cheatsheet) and the open source code can be found [here](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/settings.py).\n",
"\n",
"Here is a basic example of how to use the `Settings` class. We will be unnecessarily verbose for the sake of clarity. Please notice that most of the settings are optional and the defaults are good for most cases. Refer to the [descriptions of each setting](https://github.com/Future-House/paper-qa/blob/main/paperqa/settings.py) for more information.\n",
"Here is a basic example of how to use the `Settings` class. We will be unnecessarily verbose for the sake of clarity. Please notice that most of the settings are optional and the defaults are good for most cases. Refer to the [descriptions of each setting](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/settings.py) for more information.\n",
"\n",
"Within this `Settings` object, I'd like to discuss specifically how the llms are configured and how `paperqa` looks for papers.\n",
"\n",
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/settings_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,9 +75,9 @@ async with aiohttp.ClientSession() as session, session.get(url, timeout=60) as r
```

The `Settings` class is used to configure the PaperQA settings.
Official documentation can be found [here](https://github.com/Future-House/paper-qa?tab=readme-ov-file#settings-cheatsheet) and the open source code can be found [here](https://github.com/Future-House/paper-qa/blob/main/paperqa/settings.py).
Official documentation can be found [here](https://github.com/Future-House/paper-qa?tab=readme-ov-file#settings-cheatsheet) and the open source code can be found [here](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/settings.py).

Here is a basic example of how to use the `Settings` class. We will be unnecessarily verbose for the sake of clarity. Please notice that most of the settings are optional and the defaults are good for most cases. Refer to the [descriptions of each setting](https://github.com/Future-House/paper-qa/blob/main/paperqa/settings.py) for more information.
Here is a basic example of how to use the `Settings` class. We will be unnecessarily verbose for the sake of clarity. Please notice that most of the settings are optional and the defaults are good for most cases. Refer to the [descriptions of each setting](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/settings.py) for more information.

Within this `Settings` object, I'd like to discuss specifically how the llms are configured and how `paperqa` looks for papers.

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/where_do_I_get_papers.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ You can also manually drag-and-drop PDFs onto each reference.
To download papers, you need to get an API key for your account.

1. Get your library ID, and set it as the environment variable `ZOTERO_USER_ID`.
- For personal libraries, this ID is given [here](https://www.zotero.org/settings/keys) at the part "_Your userID for use in API calls is XXXXXX_".
- For personal libraries, this ID is given [here](https://www.zotero.org/settings/security#applications) at the part "_Your userID for use in API calls is XXXXXX_".
- For group libraries, go to your group page `https://www.zotero.org/groups/groupname`, and hover over the settings link. The ID is the integer after /groups/. (_h/t pyzotero!_)
2. Create a new API key [here](https://www.zotero.org/settings/keys/new) and set it as the environment variable `ZOTERO_API_KEY`.
- The key will need read access to the library.
Expand Down
4 changes: 3 additions & 1 deletion packages/paper-qa-pymupdf/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# paper-qa-pymupdf

[![GitHub](https://img.shields.io/badge/github-%23121011.svg?logo=github&logoColor=white)](https://github.com/Future-House/paper-qa/tree/main/packages/paper-qa-pymupdf)
<!-- pyml disable-num-lines 6 line-length -->

[![GitHub](https://img.shields.io/badge/GitHub-black?logo=github&logoColor=white)](https://github.com/Future-House/paper-qa/tree/main/packages/paper-qa-pymupdf)
[![PyPI version](https://badge.fury.io/py/paper-qa-pymupdf.svg)](https://badge.fury.io/py/paper-qa-pymupdf)
[![tests](https://github.com/Future-House/paper-qa/actions/workflows/tests.yml/badge.svg)](https://github.com/Future-House/paper-qa)
![License](https://img.shields.io/badge/license-AGPLv3-blue.svg)
Expand Down
Loading
Loading