Releases · pdfminer/pdfminer.six · GitHub

06 May 16:17

20250506 Latest

Latest

Added

Support for extracting images with TIFF predictor (#1058)

Fixed

Correct tightest fitting bounding boxes for rotated content (#1114)
TypeError when passing wrong number of arguments to safe_rgb (#1118)
OverflowError in safe_float when input is too large (#1121)
Saving colour spaces on the graphics stack (#1119)
Remove padding from AES-encrypted strings(#1123)

Assets 4

16 Apr 09:43

20250416

Fixed

TypeError when parsing font width with indirect object references (#1098)
ValueError when loading xref with invalid position or generation numbers that cannot be parsed as int (#1099)
Safely converting PDF stack objects to float or int in PDFInterpreter (#1100)
TypeError when parsing font bbox with incorrect values (#1103)
ValueError on incorrect stream lengths for ASCII85 data (#1112)

Assets 4

27 Mar 07:52

20250327

Added

Support for Python 3.13 (#1092)

Changed

Reduce memory overhead on runlength encoding by using lists (#1055)
Using pyproject.toml instead of setup.py (#1028)

Fixed

TypeError when CID character widths are not parseable as floats (#1001)
TypeError raised by extract_text method with compressed PDF file (#1029)
PSBaseParser can't handle tokens split across end of buffer (#1030)
TypeError when CropBox is an indirect object reference (#1004)
Remove redundant line to be able to recognize rectangles (#1066)
Support indirect objects for filters (#1062)
Make sure bytes is bytes where it counts (#1069)

Removed

Support for Python 3.8 (#1091)

Assets 4

24 Mar 07:31

20250324

Changed

Using absolute instead of relative imports ([#995])

Deprecated

The third argument (generation number) to PDFObjRef (#972)

Fixed

TypeError when corrupt PDF object reference cannot be parsed as int (#972)])
TypeError when corrupt PDF literal cannot be converted to str (#978)
ValueError when corrupt PDF specifies a negative xref location (#980)
ValueError when corrupt PDF specifies an invalid mediabox (#987)
RecursionError when corrupt PDF specifies a recursive /Pages object (#998)
TypeError when corrupt PDF specifies text-positioning operators with invalid values (#1000)
inline image parsing fails when stream data contains "EI\n" (#1008)
TypeError when parsing object reference as mediabox (#1082)

Removed

Deprecated tools, functions and classes (#974)

Assets 4

06 Jul 13:48

20240706

Added

Support for zipped jpeg's (#938)
Fuzzing harnesses for integration into Google's OSS-Fuzz (949)
Support for setuptools-git-versioning version 2.0.0 (#957)

Fixed

Resolving mediabox and pdffont (#834)
Keywords that aren't terminated by the pattern END_KEYWORD before end-of-stream are parsed (#885)
ValueError wrong error message when specifying codec for text output (#902)
Resolve stream filter parameters (#906)
Reading cmap's with whitespace in the name (#935)
Optimize apply_png_predictor by using lists (#912)

Changed

Updated Python 3.7 syntax to 3.8 (#956)
Updated all Python version specifications to a minimum of 3.8 (#969)

Assets 4

28 Dec 21:25

20231228

Added

Output converter for the hOCR format (#651)
Font name aliases for Arial, Courier New and Times New Roman (#790)
Documentation on why special characters can sometimes not be extracted (#829)
Storing Bezier path and dashing style of line in LTCurve (#801)

Fixed

Broken CI/CD pipeline by setting upper version limit for black, mypy, pip and setuptools (#921)
flake8 failures (#921)
ValueError when bmp images with 1 bit channel are decoded (#773)
ValueError when trying to decrypt empty metadata values (#766)
Sphinx errors during building of documentation (#760)
TypeError when getting default width of font (#720)
Installing typing-extensions on Python 3.6 and 3.7 (#775)
TypeError in cmapdb.py when parsing null characters (#768)
Color "convenience operators" now (per spec) also set color space (#794)
ValueError when extracting images, due to breaking changes in Pillow (#827)
Small typo's and issues in the documentation (#828)
Ignore non-Unicode cmaps in TrueType fonts (#806)

Changed

Using non-hardcoded version string and setuptools-git-versioning to enable installation from source and building on Python 3.12 (#922)

Deprecated

Usage of if __name__ == "__main__" where it was only intended for testing purposes (#756)

Removed

Support for Python 3.6 and 3.7 because they are end-of-life (#923)

Assets 4

05 Nov 16:33

20221105

Added

Output converter for the hOCR format (#651)
Font name aliases for Arial, Courier New and Times New Roman (#790)
Documentation on why special characters can sometimes not be extracted (#829)

Fixed

ValueError when bmp images with 1 bit channel are decoded (#773)
ValueError when trying to decrypt empty metadata values (#766)
Sphinx errors during building of documentation (#760)
TypeError when getting default width of font (#720)
Installing typing-extensions on Python 3.6 and 3.7 (#775)
TypeError in cmapdb.py when parsing null characters (#768)
Color "convenience operators" now (per spec) also set color space (#794)
ValueError when extracting images, due to breaking changes in Pillow (#827)
Small typo's and issues in the documentation (#828)

Deprecated

Usage of if __name__ == "__main__" where it was only intended for testing purposes (#756)

Assets 4

24 May 17:44

20220524

Fixed

Ignoring (invalid) path constructors that do not begin with m (#749)

Changed

Removed upper version bounds (#755)

Assets 4

06 May 20:04

20220506

Fixed

IndexError when handling invalid bfrange code map in
CMap (#731)
TypeError in lzw.py when self.table is not set (#732)
TypeError in encodingdb.py when name of unicode is not
str (#733)
TypeError in HTMLConverter when using a bytes fontname (#734)

Added

Exporting images without any specific encoding (#737)

Changed

Using charset-normalizer instead of chardet for less restrictive license (#744)

Assets 4

19 Mar 20:13

20220319

Added

Export type annotations from pypi package per PEP561 (#679)
Support for identity cmap's (#626)
Add support for PDF page labels (#680)
Installation of Pillow as an optional extra dependency (#714)

Fixed

Hande decompression error due to CRC checksum error (#637)
Regression (since 20191107) in LTLayoutContainer.group_textboxes that returned some text lines out of order (#659)
Add handling of JPXDecode filter to enable extraction of images for some pdfs (#645)
Fix extraction of jbig2 files, which was producing invalid files (#652)
Crash in pdf2txt.py --boxes-flow=disabled (#682)
Only use xref fallback if PDFNoValidXRef is raised and fallback is True (#684)
Ignore empty characters when analyzing layout (#499)

Changed

Replace warnings.warn with logging.Logger.warning in line with recommended use (#673)
Switched from nose to pytest, from tox to nox and from Travis CI to GitHub Actions (#704)

Removed

Unnecessary return statements without argument at the end of functions (#707)

Assets 4