-
Notifications
You must be signed in to change notification settings - Fork 770
Fix: handle non-UTF-8 input in util.py #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…use#1059) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Chemistry papers such as DOIs 10.26434/chemrxiv-2025-1x058-v2 and 10.26434/chemrxiv-2025-3lwn2 contain files that sometimes throw errors due to non-UTF-8 characters during parsing. This change updates the MD5 hashing line to replace problematic characters during UTF-8 encoding, making the parsing process more robust and preventing interruptions caused by encoding errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a Unicode encoding issue in the MD5 hashing function to handle non-UTF-8 characters in chemistry paper files. The change prevents parsing interruptions by replacing problematic characters during UTF-8 encoding instead of raising encoding errors.
- Updated the
hexdigest
function to use error handling during UTF-8 encoding - Added
errors="replace"
parameter to gracefully handle non-UTF-8 characters
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/paperqa/utils.py
Outdated
@@ -104,7 +104,7 @@ def strings_similarity(s1: str, s2: str, case_insensitive: bool = True) -> float | |||
|
|||
def hexdigest(data: str | bytes) -> str: | |||
if isinstance(data, str): | |||
return hashlib.md5(data.encode("utf-8")).hexdigest() # noqa: S324 | |||
return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest() # noqa: S324 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 'replace' error handling may silently mask data corruption issues. Consider logging when characters are replaced or using 'ignore' if replacement characters could affect hash consistency. This change could cause different hash values for the same logical content depending on encoding issues.
return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest() # noqa: S324 | |
return hashlib.md5(data.encode("utf-8", errors="strict")).hexdigest() # noqa: S324 |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, just a quick comment
src/paperqa/utils.py
Outdated
@@ -104,7 +104,7 @@ def strings_similarity(s1: str, s2: str, case_insensitive: bool = True) -> float | |||
|
|||
def hexdigest(data: str | bytes) -> str: | |||
if isinstance(data, str): | |||
return hashlib.md5(data.encode("utf-8")).hexdigest() # noqa: S324 | |||
return hashlib.md5(data.encode("utf-8", errors="replace")).hexdigest() # noqa: S324 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a code comment above this stating why we have replace
over strict
or ignore
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure thing!
Explain choice of 'replace' over 'strict' and 'ignore'
@dmcgrath19 - why would it be necessary to replace the characters? Shouldn't we just ignore? This code only computes the hash - it has no effect on the actual usage of the parsed paper. |
@whitead My reasoning for replacing rather than ignoring invalid characters is that the hash serves as a unique identifier. If invalid characters are ignored and dropped silently, different inputs with subtle encoding differences could produce identical hashes, causing issues with deduplication and caching. By replacing invalid characters, we ensure the hash reflects all differences in the input data, preserving uniqueness and making encoding problems visible. TLDR: it's more robust but could be seen as a bit messier. |
Hi @dmcgrath19 I was trying to repro this issue just so I can make a unit test. This is not hitting the UTF errors, what am I missing? import pytest
from paperqa import Docs
from paperqa.agents import SearchIndex
@pytest.mark.asyncio
async def test_pdf() -> None:
docs = Docs()
docname = await docs.aadd_url(
# URL from the download PDF button from:
# - 10.26434/chemrxiv-2025-3lwn2: https://chemrxiv.org/engage/chemrxiv/article-details/687f4677728bf9025ea3067a
# - 10.26434/chemrxiv-2025-1x058-v2: https://chemrxiv.org/engage/chemrxiv/article-details/6835abf53ba0887c33410d6d
"..."
)
(doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
index_doc = {
"title": doc_details.title,
"file_location": "all-atom",
"body": "".join(t.text for t in docs.texts if t.doc == doc_details),
}
search_index = SearchIndex(fields=[*SearchIndex.REQUIRED_FIELDS, "title"])
await search_index.add_document(index_doc, document=docs) |
Thanks for digging into this. In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading. |
Ah that makes sense, thanks for clarifying that. Did you actually test this change out yet? I made the below unit test for a bad PDF in import shutil
import tempfile
from pathlib import Path
import pytest
from paperqa import DocDetails, Docs
from paperqa.agents import SearchIndex
@pytest.mark.asyncio
async def test_search_corrupt_pdf(stub_data_dir: Path) -> None:
"""Test that a slightly flawed PDF can still work."""
with tempfile.TemporaryDirectory() as td:
tmp_pdf = Path(td) / "paper.pdf"
shutil.copy(stub_data_dir / "paper.pdf", tmp_pdf)
# Create standard Docs object with the PDF we will shortly make corrupt
docs = Docs()
docname = await docs.aadd(tmp_pdf)
(doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
assert isinstance(doc_details, DocDetails)
assert doc_details.title, "This tests's index doc requires a populated title"
assert doc_details.year, "This tests's index doc requires a populated year"
# Now, let's make a corrupt PDF body
body = "".join(t.text for t in docs.texts if t.doc == doc_details)
# body += "\udcff" # Inject unpaired low surrogate
body = "\udcff" # Keep exception short
# Confirm we can both index and query this document
search_index = SearchIndex(
fields=[*SearchIndex.REQUIRED_FIELDS, "title", "year"]
)
index_doc = {
"title": doc_details.title,
"year": doc_details.year,
"file_location": str(tmp_pdf.absolute()),
"body": body,
}
await search_index.add_document(index_doc, document=docs)
results = await search_index.query("XAI", keep_filenames=True)
assert {(r[0].id, Path(r[1]).name) for r in results} == {(docs.id, "paper.pdf")} Running that with this change, I am seeing:
So this PR is an improvement (we get further), but I think one would still get blown up. Can you confirm with this change that you would still be unable to index your ChemRxiv folder, now due to And if the index build actually succeeds, can you try to improve this unit test and add it to this PR? Also get |
097ccd2
to
27c88de
Compare
Chemistry papers such as DOIs 10.26434/chemrxiv-2025-1x058-v2 and 10.26434/chemrxiv-2025-3lwn2 contain files that sometimes throw errors due to non-UTF-8 characters during parsing. This change updates the MD5 hashing line to replace problematic characters during UTF-8 encoding, making the parsing process more robust and preventing interruptions caused by encoding errors.
Closes #1068