Skip to content

Conversation

dmcgrath19
Copy link

@dmcgrath19 dmcgrath19 commented Aug 20, 2025

Chemistry papers such as DOIs 10.26434/chemrxiv-2025-1x058-v2 and 10.26434/chemrxiv-2025-3lwn2 contain files that sometimes throw errors due to non-UTF-8 characters during parsing. This change updates the MD5 hashing line to replace problematic characters during UTF-8 encoding, making the parsing process more robust and preventing interruptions caused by encoding errors.

Closes #1068

jamesbraza and others added 18 commits August 5, 2025 22:31
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…use#1059)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Chemistry papers such as DOIs 10.26434/chemrxiv-2025-1x058-v2 and 10.26434/chemrxiv-2025-3lwn2 contain files that sometimes throw errors due to non-UTF-8 characters during parsing. This change updates the MD5 hashing line to replace problematic characters during UTF-8 encoding, making the parsing process more robust and preventing interruptions caused by encoding errors.
@Copilot Copilot AI review requested due to automatic review settings August 20, 2025 16:20
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Aug 20, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a Unicode encoding issue in the MD5 hashing function to handle non-UTF-8 characters in chemistry paper files. The change prevents parsing interruptions by replacing problematic characters during UTF-8 encoding instead of raising encoding errors.

  • Updated the hexdigest function to use error handling during UTF-8 encoding
  • Added errors="replace" parameter to gracefully handle non-UTF-8 characters

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@@ -104,7 +104,7 @@ def strings_similarity(s1: str, s2: str, case_insensitive: bool = True) -> float

def hexdigest(data: str | bytes) -> str:
if isinstance(data, str):
return hashlib.md5(data.encode("utf-8")).hexdigest() # noqa: S324
return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest() # noqa: S324
Copy link
Preview

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 'replace' error handling may silently mask data corruption issues. Consider logging when characters are replaced or using 'ignore' if replacement characters could affect hash consistency. This change could cause different hash values for the same logical content depending on encoding issues.

Suggested change
return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest() # noqa: S324
return hashlib.md5(data.encode("utf-8", errors="strict")).hexdigest() # noqa: S324

Copilot uses AI. Check for mistakes.

@dmcgrath19 dmcgrath19 mentioned this pull request Aug 20, 2025
Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, just a quick comment

@@ -104,7 +104,7 @@ def strings_similarity(s1: str, s2: str, case_insensitive: bool = True) -> float

def hexdigest(data: str | bytes) -> str:
if isinstance(data, str):
return hashlib.md5(data.encode("utf-8")).hexdigest() # noqa: S324
return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest() # noqa: S324
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a code comment above this stating why we have replace over strict or ignore?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure thing!

Explain choice of 'replace' over 'strict' and 'ignore'
@whitead
Copy link
Collaborator

whitead commented Aug 20, 2025

@dmcgrath19 - why would it be necessary to replace the characters? Shouldn't we just ignore? This code only computes the hash - it has no effect on the actual usage of the parsed paper.

@dmcgrath19
Copy link
Author

@whitead My reasoning for replacing rather than ignoring invalid characters is that the hash serves as a unique identifier. If invalid characters are ignored and dropped silently, different inputs with subtle encoding differences could produce identical hashes, causing issues with deduplication and caching. By replacing invalid characters, we ensure the hash reflects all differences in the input data, preserving uniqueness and making encoding problems visible.

TLDR: it's more robust but could be seen as a bit messier.

@jamesbraza
Copy link
Collaborator

jamesbraza commented Aug 21, 2025

Hi @dmcgrath19 I was trying to repro this issue just so I can make a unit test. This is not hitting the UTF errors, what am I missing?

import pytest

from paperqa import Docs
from paperqa.agents import SearchIndex


@pytest.mark.asyncio
async def test_pdf() -> None:
    docs = Docs()
    docname = await docs.aadd_url(
        # URL from the download PDF button from:
        # - 10.26434/chemrxiv-2025-3lwn2: https://chemrxiv.org/engage/chemrxiv/article-details/687f4677728bf9025ea3067a
        # - 10.26434/chemrxiv-2025-1x058-v2: https://chemrxiv.org/engage/chemrxiv/article-details/6835abf53ba0887c33410d6d
        "..."
    )
    (doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
    index_doc = {
        "title": doc_details.title,
        "file_location": "all-atom",
        "body": "".join(t.text for t in docs.texts if t.doc == doc_details),
    }

    search_index = SearchIndex(fields=[*SearchIndex.REQUIRED_FIELDS, "title"])
    await search_index.add_document(index_doc, document=docs)

@dmcgrath19
Copy link
Author

Thanks for digging into this.

In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading.

@jamesbraza
Copy link
Collaborator

jamesbraza commented Aug 25, 2025

In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading.

Ah that makes sense, thanks for clarifying that.


Did you actually test this change out yet? I made the below unit test for a bad PDF in test_agents.py:

import shutil
import tempfile
from pathlib import Path

import pytest

from paperqa import DocDetails, Docs
from paperqa.agents import SearchIndex


@pytest.mark.asyncio
async def test_search_corrupt_pdf(stub_data_dir: Path) -> None:
    """Test that a slightly flawed PDF can still work."""
    with tempfile.TemporaryDirectory() as td:
        tmp_pdf = Path(td) / "paper.pdf"
        shutil.copy(stub_data_dir / "paper.pdf", tmp_pdf)

        # Create standard Docs object with the PDF we will shortly make corrupt
        docs = Docs()
        docname = await docs.aadd(tmp_pdf)
        (doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
        assert isinstance(doc_details, DocDetails)
        assert doc_details.title, "This tests's index doc requires a populated title"
        assert doc_details.year, "This tests's index doc requires a populated year"
        # Now, let's make a corrupt PDF body
        body = "".join(t.text for t in docs.texts if t.doc == doc_details)
        # body += "\udcff"  # Inject unpaired low surrogate
        body = "\udcff"  # Keep exception short

        # Confirm we can both index and query this document
        search_index = SearchIndex(
            fields=[*SearchIndex.REQUIRED_FIELDS, "title", "year"]
        )
        index_doc = {
            "title": doc_details.title,
            "year": doc_details.year,
            "file_location": str(tmp_pdf.absolute()),
            "body": body,
        }
        await search_index.add_document(index_doc, document=docs)
        results = await search_index.query("XAI", keep_filenames=True)
        assert {(r[0].id, Path(r[1]).name) for r in results} == {(docs.id, "paper.pdf")}

Running that with this change, I am seeing:

  1. paperqa.utils.hexdigest(body) now succeeds (which makes sense due to errors="replace")
  2. Immediately after, tantivy.Document.from_dict(index_doc) fails with ValueError: Value unsupported '\udcff'

So this PR is an improvement (we get further), but I think one would still get blown up. Can you confirm with this change that you would still be unable to index your ChemRxiv folder, now due to tantivy.Document.from_dict?

And if the index build actually succeeds, can you try to improve this unit test and add it to this PR? Also get pre-commit passing and rebase atop main branch

@jamesbraza jamesbraza force-pushed the main branch 2 times, most recently from 097ccd2 to 27c88de Compare August 26, 2025 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error Parsing PDF
5 participants