Skip to content

Added DOC file support to MarkItDown #1316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dizzydroid
Copy link

@dizzydroid dizzydroid commented Jul 8, 2025

Summary

Adds support for legacy Microsoft Word DOC files (.doc) to MarkItDown.

Implementation Details

I could not find an out-of-the-box library to do doc to md conversion, so I went with a 2-step approach, converting the doc to docx then converting the docx using the converter module to md. The minor issue here is the dependencies, all libraries require some sort of dependency (usually Libreoffice), I implemented an OS-specific approach that checks if the user is on Linux, it uses the Libreoffice cli tool, but, on Windows it would use MS Word's COM interface, this is to eliminate the need to install external dependencies as much as possible.

Testing

  • All existing tests pass
  • DocConverter properly registered and accepts DOC files, correctly parses content.
    (Testing passed on Linux & Windows)

Fixes #23, #1220

- Add DocConverter for legacy Microsoft Word DOC files
- Uses pure Python approach with olefile (existing dependency)
- Handles .doc files and application/msword mimetype
- Adds doc optional dependency group in pyproject.toml
- Updates converter registration in main MarkItDown class
- Adds test vector for DOC file conversion
- No external system dependencies required
@dizzydroid
Copy link
Author

@microsoft-github-policy-service agree

@BetterAndBetterII
Copy link
Contributor

really need it

dizzydroid and others added 2 commits July 15, 2025 00:13
This commit replaces the old implementation with a robust, two-step conversion process that significantly improves reliability and accuracy:

1.  The `_doc_converter` now first converts the input `.doc` file to a `.docx` file using OS-dependent tools:
    - **Windows**: Microsoft Word's COM interface via `pywin32`.
    - **Linux/macOS**: LibreOffice/Soffice command-line interface.

2.  The `_docx_converter` is then used to convert the `.docx` file into markdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants