Fix: handle non-UTF-8 input in util.py #1069

dmcgrath19 · 2025-08-20T16:20:03Z

Chemistry papers such as DOIs 10.26434/chemrxiv-2025-1x058-v2 and 10.26434/chemrxiv-2025-3lwn2 contain files that sometimes throw errors due to non-UTF-8 characters during parsing. This change updates the MD5 hashing line to replace problematic characters during UTF-8 encoding, making the parsing process more robust and preventing interruptions caused by encoding errors.

Closes #1068

…se#1048)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

…use#1059) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

…ies (Future-House#1055)

…House#1056)

…use#1057)

…uture-House#1060)

)

Chemistry papers such as DOIs 10.26434/chemrxiv-2025-1x058-v2 and 10.26434/chemrxiv-2025-3lwn2 contain files that sometimes throw errors due to non-UTF-8 characters during parsing. This change updates the MD5 hashing line to replace problematic characters during UTF-8 encoding, making the parsing process more robust and preventing interruptions caused by encoding errors.

Copilot

Pull Request Overview

This PR fixes a Unicode encoding issue in the MD5 hashing function to handle non-UTF-8 characters in chemistry paper files. The change prevents parsing interruptions by replacing problematic characters during UTF-8 encoding instead of raising encoding errors.

Updated the hexdigest function to use error handling during UTF-8 encoding
Added errors="replace" parameter to gracefully handle non-UTF-8 characters

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-20T16:20:25Z

src/paperqa/utils.py

@@ -104,7 +104,7 @@ def strings_similarity(s1: str, s2: str, case_insensitive: bool = True) -> float

 def hexdigest(data: str | bytes) -> str:
    if isinstance(data, str):
-        return hashlib.md5(data.encode("utf-8")).hexdigest()  # noqa: S324
+        return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest()  # noqa: S324


Using 'replace' error handling may silently mask data corruption issues. Consider logging when characters are replaced or using 'ignore' if replacement characters could affect hash consistency. This change could cause different hash values for the same logical content depending on encoding issues.

Suggested change

return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest() # noqa: S324

return hashlib.md5(data.encode("utf-8", errors="strict")).hexdigest() # noqa: S324

jamesbraza

Thanks for the PR, just a quick comment

jamesbraza · 2025-08-20T18:06:58Z

src/paperqa/utils.py

@@ -104,7 +104,7 @@ def strings_similarity(s1: str, s2: str, case_insensitive: bool = True) -> float

 def hexdigest(data: str | bytes) -> str:
    if isinstance(data, str):
-        return hashlib.md5(data.encode("utf-8")).hexdigest()  # noqa: S324
+        return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest()  # noqa: S324


Can you add a code comment above this stating why we have replace over strict or ignore?

Yeah sure thing!

Explain choice of 'replace' over 'strict' and 'ignore'

whitead · 2025-08-20T20:42:00Z

@dmcgrath19 - why would it be necessary to replace the characters? Shouldn't we just ignore? This code only computes the hash - it has no effect on the actual usage of the parsed paper.

dmcgrath19 · 2025-08-20T21:32:51Z

@whitead My reasoning for replacing rather than ignoring invalid characters is that the hash serves as a unique identifier. If invalid characters are ignored and dropped silently, different inputs with subtle encoding differences could produce identical hashes, causing issues with deduplication and caching. By replacing invalid characters, we ensure the hash reflects all differences in the input data, preserving uniqueness and making encoding problems visible.

TLDR: it's more robust but could be seen as a bit messier.

jamesbraza · 2025-08-21T22:07:52Z

Hi @dmcgrath19 I was trying to repro this issue just so I can make a unit test. This is not hitting the UTF errors, what am I missing?

import pytest

from paperqa import Docs
from paperqa.agents import SearchIndex


@pytest.mark.asyncio
async def test_pdf() -> None:
    docs = Docs()
    docname = await docs.aadd_url(
        # URL from the download PDF button from:
        # - 10.26434/chemrxiv-2025-3lwn2: https://chemrxiv.org/engage/chemrxiv/article-details/687f4677728bf9025ea3067a
        # - 10.26434/chemrxiv-2025-1x058-v2: https://chemrxiv.org/engage/chemrxiv/article-details/6835abf53ba0887c33410d6d
        "..."
    )
    (doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
    index_doc = {
        "title": doc_details.title,
        "file_location": "all-atom",
        "body": "".join(t.text for t in docs.texts if t.doc == doc_details),
    }

    search_index = SearchIndex(fields=[*SearchIndex.REQUIRED_FIELDS, "title"])
    await search_index.add_document(index_doc, document=docs)

dmcgrath19 · 2025-08-25T12:35:22Z

Thanks for digging into this.

In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading.

jamesbraza · 2025-08-25T20:58:20Z

In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading.

Ah that makes sense, thanks for clarifying that.

Did you actually test this change out yet? I made the below unit test for a bad PDF in test_agents.py:

import shutil
import tempfile
from pathlib import Path

import pytest

from paperqa import DocDetails, Docs
from paperqa.agents import SearchIndex


@pytest.mark.asyncio
async def test_search_corrupt_pdf(stub_data_dir: Path) -> None:
    """Test that a slightly flawed PDF can still work."""
    with tempfile.TemporaryDirectory() as td:
        tmp_pdf = Path(td) / "paper.pdf"
        shutil.copy(stub_data_dir / "paper.pdf", tmp_pdf)

        # Create standard Docs object with the PDF we will shortly make corrupt
        docs = Docs()
        docname = await docs.aadd(tmp_pdf)
        (doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
        assert isinstance(doc_details, DocDetails)
        assert doc_details.title, "This tests's index doc requires a populated title"
        assert doc_details.year, "This tests's index doc requires a populated year"
        # Now, let's make a corrupt PDF body
        body = "".join(t.text for t in docs.texts if t.doc == doc_details)
        # body += "\udcff"  # Inject unpaired low surrogate
        body = "\udcff"  # Keep exception short

        # Confirm we can both index and query this document
        search_index = SearchIndex(
            fields=[*SearchIndex.REQUIRED_FIELDS, "title", "year"]
        )
        index_doc = {
            "title": doc_details.title,
            "year": doc_details.year,
            "file_location": str(tmp_pdf.absolute()),
            "body": body,
        }
        await search_index.add_document(index_doc, document=docs)
        results = await search_index.query("XAI", keep_filenames=True)
        assert {(r[0].id, Path(r[1]).name) for r in results} == {(docs.id, "paper.pdf")}

Running that with this change, I am seeing:

paperqa.utils.hexdigest(body) now succeeds (which makes sense due to errors="replace")
Immediately after, tantivy.Document.from_dict(index_doc) fails with ValueError: Value unsupported '\udcff'

So this PR is an improvement (we get further), but I think one would still get blown up. Can you confirm with this change that you would still be unable to index your ChemRxiv folder, now due to tantivy.Document.from_dict?

And if the index build actually succeeds, can you try to improve this unit test and add it to this PR? Also get pre-commit passing and rebase atop main branch

jamesbraza and others added 18 commits August 5, 2025 22:31

Image reader and image support in gather_evidence (Future-House#1046)

5675e97

Preventing Greek name from crashing DocDetails creation (Future-Hou…

d1cde22

…se#1048)

Better invalid name logs (Future-House#1049)

0caa926

Multimodal PDF support (Future-House#1047)

f71d023

Fix paperqa/configs link in README.md (Future-House#1051)

d3760c9

Fixing all broken links, pymarkdown (Future-House#1052)

948423f

Documenting DocMetadataTask/MetadataProvider (Future-House#1050)

d34543e

chore(deps): lock file maintenance (Future-House#1053)

beb838e

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update actions/checkout action to v5 (Future-House#1058)

81ea858

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update actions/download-artifact action to v5 (Future-Ho…

193feb6

…use#1059) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Moved mypy to use local hook to unsilence it on missing dependenc…

72b7f26

…ies (Future-House#1055)

Removed dead patch from test_add_clinical_trials_to_docs (Future-…

6801155

…House#1056)

Restored UnpaywallProvider by updating expected response (Future-Ho…

34ceb70

…use#1057)

Refreshed test_crossref_journalquality_fields_filtering cassette (F…

fab944a

…uture-House#1060)

Updating journal_quality.csv from script (Future-House#1061)

4862764

Better lookup failure message in Settings.from_name (Future-House#1064

2d58202

)

Lower bibtex logging to debug (Future-House#1067)

62afb8c

Copilot AI review requested due to automatic review settings August 20, 2025 16:20

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Aug 20, 2025

Copilot AI reviewed Aug 20, 2025

View reviewed changes

dmcgrath19 mentioned this pull request Aug 20, 2025

Error Parsing PDF #1068

Open

jamesbraza reviewed Aug 20, 2025

View reviewed changes

Add comments to utf-8 error handling

7a88d42

Explain choice of 'replace' over 'strict' and 'ignore'

[pre-commit.ci lite] apply automatic fixes

46e21d6

jamesbraza force-pushed the main branch 2 times, most recently from 097ccd2 to 27c88de Compare August 26, 2025 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: handle non-UTF-8 input in util.py #1069

Fix: handle non-UTF-8 input in util.py #1069

Uh oh!

dmcgrath19 commented Aug 20, 2025 •

edited by jamesbraza

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 20, 2025

Uh oh!

jamesbraza left a comment

Uh oh!

jamesbraza Aug 20, 2025

Uh oh!

dmcgrath19 Aug 20, 2025

Uh oh!

whitead commented Aug 20, 2025

Uh oh!

dmcgrath19 commented Aug 20, 2025

Uh oh!

jamesbraza commented Aug 21, 2025 •

edited

Loading

Uh oh!

dmcgrath19 commented Aug 25, 2025

Uh oh!

jamesbraza commented Aug 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

	return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest() # noqa: S324
	return hashlib.md5(data.encode("utf-8", errors="strict")).hexdigest() # noqa: S324

Fix: handle non-UTF-8 input in util.py #1069

Are you sure you want to change the base?

Fix: handle non-UTF-8 input in util.py #1069

Uh oh!

Conversation

dmcgrath19 commented Aug 20, 2025 • edited by jamesbraza Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

jamesbraza Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

dmcgrath19 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

whitead commented Aug 20, 2025

Uh oh!

dmcgrath19 commented Aug 20, 2025

Uh oh!

jamesbraza commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmcgrath19 commented Aug 25, 2025

Uh oh!

jamesbraza commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dmcgrath19 commented Aug 20, 2025 •

edited by jamesbraza

Loading

jamesbraza commented Aug 21, 2025 •

edited

Loading

jamesbraza commented Aug 25, 2025 •

edited

Loading