markitdown-pro

markitdown-pro is an open-source Python library that converts diverse document formats into clean Markdown text. It enhances Microsoft's MarkItDown with Azure Document Intelligence, Unstructured.io support, and OCR capabilities, providing reliable conversion for 30+ file types.

Published by Kinetic Solutions Group LLC under the MIT License.

Installation

pip install markitdown-pro

Or with uv:

uv add markitdown-pro

Requires Python >= 3.12.2.

Conversion Pipeline

markitdown-pro uses a cascading approach, trying each conversion method in sequence until sufficient content is extracted:

MarkItDown — Microsoft's base converter handles common formats (DOCX, PPTX, XLSX, HTML, CSV, and more) natively with no external API calls required.
Unstructured.io — Advanced document parsing for complex layouts, tables, and multi-column documents. Handles PDFs with embedded tables and forms that MarkItDown may struggle with.
Azure Document Intelligence — OCR and structured extraction for scanned documents, images, and PDFs that contain non-selectable text. Extracts tables, key-value pairs, and layout information.
GPT-4o-mini — Vision-based OCR as the final fallback. Sends page images to GPT-4o-mini for text extraction. Handles edge cases like handwritten notes, unusual fonts, and degraded scans that other methods miss.

If a method produces no output or insufficient content, the pipeline automatically falls through to the next method. This cascade ensures maximum coverage across document quality levels.

Supported File Types

Category	Formats	Notes
Documents	PDF, DOCX, PPTX, XLSX, ODT, EPUB, RTF, HTML	Office formats use MarkItDown; scanned PDFs fall through to OCR
Images	PNG, JPG, JPEG, GIF, BMP, WEBP, HEIC, TIFF, SVG	Uses Azure Document Intelligence for OCR extraction
Audio	MP3, WAV, OGG, FLAC, M4A, AAC, WMA, WEBM, OPUS	Transcription via Azure Speech Services + OpenAI Whisper
Email	EML, PST	Full email parsing with attachment extraction and recursive processing
Archives	ZIP	Recursively extracts and converts all supported files within the archive
Text	CSV, JSON, XML, TXT, MD, RST, TSV	Direct pass-through with formatting normalization

How RAG DB Uses It

The index processor calls markitdown_pro.ConversionPipeline as step 4 of its processing pipeline. The conversion pipeline tries handlers in sequence based on the file type:

Standard documents (PDF, DOCX, PPTX, etc.) go through the full cascade: MarkItDown first, then Unstructured.io, then Azure Document Intelligence if earlier methods produce insufficient content.
Images (PNG, JPG, TIFF, etc.) are routed directly to Azure Document Intelligence for OCR text extraction. If Document Intelligence returns no text, GPT-4o-mini vision is used as a fallback.
Audio files (MP3, WAV, OGG, etc.) are transcribed using Azure Speech Services combined with OpenAI Whisper. The transcription output is formatted as markdown text for chunking and embedding.
Archives (ZIP) are extracted recursively, and each contained file is processed through the appropriate conversion handler.

The resulting markdown is then passed to the chunking step, where it is split using the appropriate LangChain text splitter for the content type.

CLI Usage

Convert a single file from the command line:

python main.py /path/to/document.pdf

Convert with a specific output directory:

python main.py /path/to/document.pdf --output /path/to/output/

Process all files in a directory:

python main.py /path/to/documents/ --recursive

Programmatic Usage

Basic conversion:

from markitdown_pro import ConversionPipeline

pipeline = ConversionPipeline()
result = pipeline.convert("document.pdf")
print(result.markdown)

With configuration for Azure services:

from markitdown_pro import ConversionPipeline

pipeline = ConversionPipeline(
    azure_doc_intelligence_endpoint="https://your-instance.cognitiveservices.azure.com/",
    azure_doc_intelligence_key="your-key",
    openai_api_key="your-openai-key",
)

result = pipeline.convert("scanned-document.pdf")
print(result.markdown)
print(f"Method used: {result.method}")
print(f"Pages processed: {result.page_count}")

Key Features

Cascading conversion — Tries multiple methods automatically, ensuring the best possible output for every file type
OCR support — Azure Document Intelligence and GPT-4o-mini for scanned documents and images
Audio transcription — Azure Speech Services + OpenAI Whisper for speech-to-text conversion
PST extraction — Full email and attachment parsing from Outlook PST archives
Concurrent PDF OCR — Page-by-page parallel OCR for large PDFs, significantly reducing processing time
ZIP handling — Recursive content extraction and conversion of all supported files within archives
Graceful fallbacks — Each conversion method fails gracefully, allowing the pipeline to try the next method without crashing

Requirements

Requirement	Value
Python	>= 3.12.2
License	MIT
Publisher	Kinetic Solutions Group LLC
PyPI	markitdown-pro

Optional dependencies for full functionality:

Azure Document Intelligence SDK — Required for OCR-based conversion
Azure Speech SDK — Required for audio transcription
OpenAI Python SDK — Required for GPT-4o-mini vision fallback and Whisper transcription
Unstructured — Required for advanced document parsing

markitdown-pro

On this page