AlcheMark.

Your files ready for Gen AI ✨🚀

AlcheMark is a lightweight PDF to Markdown, alchemical-inspired toolkit that transmutes PDF documents into structured Markdown pages—complete with rich metadata and markdown element annotations—empowering you to uncover insights page by page.

Installation

# Install from PyPI
pip install alchemark-ai

# Or install from source
git clone https://github.com/matthsena/AlcheMark-ai.git
cd AlcheMark-ai
pip install -e .

Usage

from alchemark_ai import pdf2md

# Convert PDF to markdown
# pdf_file_path: Path to the PDF document to be converted
# process_images: When True, extracts images from the PDF (default: False)
# keep_images_inline: When True, keeps base64 images inline in markdown; when False, 
#                     replaces with references to image hashes (default: False)
results = pdf2md("path/to/document.pdf", process_images=True, keep_images_inline=True)

# Each result is a FormattedResult object with the structure:
# {
#   "metadata": {
#     "file_path": str,       # Path to the PDF file
#     "page": int,            # Page number
#     "page_count": int,      # Total number of pages
#     "text_length": int,     # Length of the extracted text
#     "processed_timestamp": float  # Processing timestamp
#   },
#   "elements": {
#     "tables": List[Table],  # Tables extracted from the page
#     "images": List[Image],  # Images extracted from the page (with optional base64 data)
#     "titles": List[str],    # Titles/headers detected
#     "lists": List[str],     # List items detected
#     "links": List[Link]     # Links with text and URL
#   },
#   "text": str,              # Markdown text content
#   "tokens": int,            # Number of tokens in the text
#   "language": str           # Detected language
# }

# Access the markdown text of the first page
markdown_text = results[0].text

# Get metadata for the first page
page_number = results[0].metadata.page
total_pages = results[0].metadata.page_count

# Check elements detected in the page
tables_count = len(results[0].elements.tables)
images_count = len(results[0].elements.images)

Google Colab Example

Try AlcheMark AI directly in your browser with our interactive Google Colab notebook!

Overview

AlcheMark AI provides a seamless solution for converting PDF documents into well-structured Markdown format. The tool not only extracts the text content but also analyzes and catalogs various elements like tables, images, headings, lists, and links while tracking token counts for LLM compatibility.

Key Features

PDF to Markdown Conversion: Transform PDF documents into clean, organized Markdown
Rich Metadata Extraction: Preserve document metadata including title, author, creation date
Element Analysis: Automatic detection and counting of markdown elements (headings, lists, links)
Table & Image Support: Extract and format tables and images from PDFs
Inline Image Handling: Option to keep images inline as base64 or replace with image references
Token Counting: Built-in token counting using tiktoken for LLM integration
Structured Output: Get page-by-page results with detailed metadata

Extracted Data Fields

Field	Type	Description
metadata.file_path	`str`	Path to the original PDF file
metadata.page	`int`	Current page number
metadata.page_count	`int`	Total number of pages in the document
metadata.text_length	`int`	Character count of the extracted text
metadata.processed_timestamp	`float`	Unix timestamp when the page was processed
elements.tables	`List[Table]`	Tables extracted from the page with their structure preserved
elements.images	`List[Image]`	Images extracted from the page with their metadata, including optional base64 content and hash
elements.titles	`List[str]`	Headings and titles detected in the page
elements.lists	`List[str]`	List items (ordered and unordered) found in the page
elements.links	`List[Link]`	Hyperlinks with their display text and target URLs
text	`str`	The complete markdown text content of the page
tokens	`int`	Token count for the page (useful for LLM context planning)
language	`str`	Detected language of the page content

Configuration Options

Option	Default	Description
process_images	`False`	Enable extraction and processing of images from the PDF
keep_images_inline	`False`	Keep images inline as base64 in the markdown text. When set to `False`, images are replaced with references (`[IMAGE](hash)`)

Test Coverage

AlcheMark AI maintains a high test coverage to ensure reliability:

Name                                     Stmts   Miss  Cover
------------------------------------------------------------
alchemark_ai/configs/logger.py               2      0   100%
alchemark_ai/formatter/formatter_md.py      80      2    98%
alchemark_ai/models/FormattedResult.py      25      0   100%
alchemark_ai/models/PDFResult.py            56      0   100%
alchemark_ai/pdf2md/pdf2md.py               30      0   100%
------------------------------------------------------------
TOTAL                                      193      2    99%

Current test suite includes 37 tests covering all major functionality, with an overall coverage of 99%.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
alchemark_ai		alchemark_ai
assets		assets
examples		examples
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
publish.sh		publish.sh
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AlcheMark.

Installation

Usage

Google Colab Example

Overview

Key Features

Extracted Data Fields

Configuration Options

Test Coverage

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

matthsena/AlcheMark

Folders and files

Latest commit

History

Repository files navigation

AlcheMark.

Installation

Usage

Google Colab Example

Overview

Key Features

Extracted Data Fields

Configuration Options

Test Coverage

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages