Skip to content

TypeError: invalid length: 8 #9

@giomagg

Description

@giomagg

When running the script on around 1,600 lines csv from the Zotero library I reach a certain point where I get the following error.

  File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
TypeError: invalid length: 8

This is not the case when I run the script on a 200 or 300 lines csv. Any idea of what I could do to solve it?

This is the traceback that leads to it.

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/home/onyxia/work/citation_map/analyze_papers.py", line 155, in article_worker
    pdf_result, text, pdf_log = process_pdf(metadata)
                                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/onyxia/work/citation_map/analyze_papers.py", line 114, in process_pdf
    original_page_count, pages = pdf_to_text_list(first_pdf)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/onyxia/work/citation_map/analyze_papers.py", line 34, in pdf_to_text_list
    pages = layout_scanner.get_pages(file_loc, images_folder=None)  # you can try os.path.abspath("output/imgs")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/onyxia/work/citation_map/layout_scanner.py", line 212, in get_pages
    return with_pdf(pdf_doc, _parse_pages, pdf_pwd, *tuple([images_folder]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/onyxia/work/citation_map/layout_scanner.py", line 35, in with_pdf
    result = fn(doc, *args)
             ^^^^^^^^^^^^^^
  File "/home/onyxia/work/citation_map/layout_scanner.py", line 201, in _parse_pages
    for i, page in enumerate(PDFPage.create_pages(doc)):
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdfpage.py", line 101, in create_pages
    yield klass(document, objid, tree)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdfpage.py", line 56, in __init__
    self.mediabox = resolve1(self.attrs['MediaBox'])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdftypes.py", line 80, in resolve1
    x = x.resolve(default=default)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdftypes.py", line 67, in resolve
    return self.doc.getobj(self.objid)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdfdocument.py", line 668, in getobj
    (strmid, index, genno) = xref.get_pos(objid)
                             ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdfdocument.py", line 277, in get_pos
    f2 = nunpack(ent[self.fl1:self.fl1+self.fl2])
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pdfminer/utils.py", line 183, in nunpack
    raise TypeError('invalid length: %d' % l)
TypeError: invalid length: 8
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/onyxia/work/citation_map/analyze_papers.py", line 241, in <module>
    result = pool.map(list_worker, list(titles_dict.items()), chunksize=5)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
TypeError: invalid length: 8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions