markitdown-pro
Open-source document conversion library supporting 30+ file types
markitdown-pro is an open-source Python library that converts diverse document formats into clean Markdown text. It enhances Microsoft's MarkItDown with Azure Document Intelligence, Unstructured.io support, and OCR capabilities, providing reliable conversion for 30+ file types.
Published by Kinetic Solutions Group LLC under the MIT License.
Installation
pip install markitdown-proOr with uv:
uv add markitdown-proRequires Python >= 3.12.2.
Conversion Pipeline
markitdown-pro uses a cascading approach, trying each conversion method in sequence until sufficient content is extracted:
-
MarkItDown — Microsoft's base converter handles common formats (DOCX, PPTX, XLSX, HTML, CSV, and more) natively with no external API calls required.
-
Unstructured.io — Advanced document parsing for complex layouts, tables, and multi-column documents. Handles PDFs with embedded tables and forms that MarkItDown may struggle with.
-
Azure Document Intelligence — OCR and structured extraction for scanned documents, images, and PDFs that contain non-selectable text. Extracts tables, key-value pairs, and layout information.
-
GPT-4o-mini — Vision-based OCR as the final fallback. Sends page images to GPT-4o-mini for text extraction. Handles edge cases like handwritten notes, unusual fonts, and degraded scans that other methods miss.
If a method produces no output or insufficient content, the pipeline automatically falls through to the next method. This cascade ensures maximum coverage across document quality levels.
Supported File Types
| Category | Formats | Notes |
|---|---|---|
| Documents | PDF, DOCX, PPTX, XLSX, ODT, EPUB, RTF, HTML | Office formats use MarkItDown; scanned PDFs fall through to OCR |
| Images | PNG, JPG, JPEG, GIF, BMP, WEBP, HEIC, TIFF, SVG | Uses Azure Document Intelligence for OCR extraction |
| Audio | MP3, WAV, OGG, FLAC, M4A, AAC, WMA, WEBM, OPUS | Transcription via Azure Speech Services + OpenAI Whisper |
| EML, PST | Full email parsing with attachment extraction and recursive processing | |
| Archives | ZIP | Recursively extracts and converts all supported files within the archive |
| Text | CSV, JSON, XML, TXT, MD, RST, TSV | Direct pass-through with formatting normalization |
How RAG DB Uses It
The index processor calls markitdown_pro.ConversionPipeline as step 4 of its processing pipeline. The conversion pipeline tries handlers in sequence based on the file type:
- Standard documents (PDF, DOCX, PPTX, etc.) go through the full cascade: MarkItDown first, then Unstructured.io, then Azure Document Intelligence if earlier methods produce insufficient content.
- Images (PNG, JPG, TIFF, etc.) are routed directly to Azure Document Intelligence for OCR text extraction. If Document Intelligence returns no text, GPT-4o-mini vision is used as a fallback.
- Audio files (MP3, WAV, OGG, etc.) are transcribed using Azure Speech Services combined with OpenAI Whisper. The transcription output is formatted as markdown text for chunking and embedding.
- Archives (ZIP) are extracted recursively, and each contained file is processed through the appropriate conversion handler.
The resulting markdown is then passed to the chunking step, where it is split using the appropriate LangChain text splitter for the content type.
CLI Usage
Convert a single file from the command line:
python main.py /path/to/document.pdfConvert with a specific output directory:
python main.py /path/to/document.pdf --output /path/to/output/Process all files in a directory:
python main.py /path/to/documents/ --recursiveProgrammatic Usage
Basic conversion:
from markitdown_pro import ConversionPipeline
pipeline = ConversionPipeline()
result = pipeline.convert("document.pdf")
print(result.markdown)With configuration for Azure services:
from markitdown_pro import ConversionPipeline
pipeline = ConversionPipeline(
azure_doc_intelligence_endpoint="https://your-instance.cognitiveservices.azure.com/",
azure_doc_intelligence_key="your-key",
openai_api_key="your-openai-key",
)
result = pipeline.convert("scanned-document.pdf")
print(result.markdown)
print(f"Method used: {result.method}")
print(f"Pages processed: {result.page_count}")Key Features
- Cascading conversion — Tries multiple methods automatically, ensuring the best possible output for every file type
- OCR support — Azure Document Intelligence and GPT-4o-mini for scanned documents and images
- Audio transcription — Azure Speech Services + OpenAI Whisper for speech-to-text conversion
- PST extraction — Full email and attachment parsing from Outlook PST archives
- Concurrent PDF OCR — Page-by-page parallel OCR for large PDFs, significantly reducing processing time
- ZIP handling — Recursive content extraction and conversion of all supported files within archives
- Graceful fallbacks — Each conversion method fails gracefully, allowing the pipeline to try the next method without crashing
Requirements
| Requirement | Value |
|---|---|
| Python | >= 3.12.2 |
| License | MIT |
| Publisher | Kinetic Solutions Group LLC |
| PyPI | markitdown-pro |
Optional dependencies for full functionality:
- Azure Document Intelligence SDK — Required for OCR-based conversion
- Azure Speech SDK — Required for audio transcription
- OpenAI Python SDK — Required for GPT-4o-mini vision fallback and Whisper transcription
- Unstructured — Required for advanced document parsing