RAG.DB
markitdown-pro

markitdown-pro

Open-source document conversion library supporting 30+ file types

markitdown-pro is an open-source Python library that converts diverse document formats into clean Markdown text. It enhances Microsoft's MarkItDown with Azure Document Intelligence, Unstructured.io support, and OCR capabilities, providing reliable conversion for 30+ file types.

Published by Kinetic Solutions Group LLC under the MIT License.

Installation

pip install markitdown-pro

Or with uv:

uv add markitdown-pro

Requires Python >= 3.12.2.

Conversion Pipeline

markitdown-pro uses a cascading approach, trying each conversion method in sequence until sufficient content is extracted:

  1. MarkItDown — Microsoft's base converter handles common formats (DOCX, PPTX, XLSX, HTML, CSV, and more) natively with no external API calls required.

  2. Unstructured.io — Advanced document parsing for complex layouts, tables, and multi-column documents. Handles PDFs with embedded tables and forms that MarkItDown may struggle with.

  3. Azure Document Intelligence — OCR and structured extraction for scanned documents, images, and PDFs that contain non-selectable text. Extracts tables, key-value pairs, and layout information.

  4. GPT-4o-mini — Vision-based OCR as the final fallback. Sends page images to GPT-4o-mini for text extraction. Handles edge cases like handwritten notes, unusual fonts, and degraded scans that other methods miss.

If a method produces no output or insufficient content, the pipeline automatically falls through to the next method. This cascade ensures maximum coverage across document quality levels.

Supported File Types

CategoryFormatsNotes
DocumentsPDF, DOCX, PPTX, XLSX, ODT, EPUB, RTF, HTMLOffice formats use MarkItDown; scanned PDFs fall through to OCR
ImagesPNG, JPG, JPEG, GIF, BMP, WEBP, HEIC, TIFF, SVGUses Azure Document Intelligence for OCR extraction
AudioMP3, WAV, OGG, FLAC, M4A, AAC, WMA, WEBM, OPUSTranscription via Azure Speech Services + OpenAI Whisper
EmailEML, PSTFull email parsing with attachment extraction and recursive processing
ArchivesZIPRecursively extracts and converts all supported files within the archive
TextCSV, JSON, XML, TXT, MD, RST, TSVDirect pass-through with formatting normalization

How RAG DB Uses It

The index processor calls markitdown_pro.ConversionPipeline as step 4 of its processing pipeline. The conversion pipeline tries handlers in sequence based on the file type:

  • Standard documents (PDF, DOCX, PPTX, etc.) go through the full cascade: MarkItDown first, then Unstructured.io, then Azure Document Intelligence if earlier methods produce insufficient content.
  • Images (PNG, JPG, TIFF, etc.) are routed directly to Azure Document Intelligence for OCR text extraction. If Document Intelligence returns no text, GPT-4o-mini vision is used as a fallback.
  • Audio files (MP3, WAV, OGG, etc.) are transcribed using Azure Speech Services combined with OpenAI Whisper. The transcription output is formatted as markdown text for chunking and embedding.
  • Archives (ZIP) are extracted recursively, and each contained file is processed through the appropriate conversion handler.

The resulting markdown is then passed to the chunking step, where it is split using the appropriate LangChain text splitter for the content type.

CLI Usage

Convert a single file from the command line:

python main.py /path/to/document.pdf

Convert with a specific output directory:

python main.py /path/to/document.pdf --output /path/to/output/

Process all files in a directory:

python main.py /path/to/documents/ --recursive

Programmatic Usage

Basic conversion:

from markitdown_pro import ConversionPipeline

pipeline = ConversionPipeline()
result = pipeline.convert("document.pdf")
print(result.markdown)

With configuration for Azure services:

from markitdown_pro import ConversionPipeline

pipeline = ConversionPipeline(
    azure_doc_intelligence_endpoint="https://your-instance.cognitiveservices.azure.com/",
    azure_doc_intelligence_key="your-key",
    openai_api_key="your-openai-key",
)

result = pipeline.convert("scanned-document.pdf")
print(result.markdown)
print(f"Method used: {result.method}")
print(f"Pages processed: {result.page_count}")

Key Features

  • Cascading conversion — Tries multiple methods automatically, ensuring the best possible output for every file type
  • OCR support — Azure Document Intelligence and GPT-4o-mini for scanned documents and images
  • Audio transcription — Azure Speech Services + OpenAI Whisper for speech-to-text conversion
  • PST extraction — Full email and attachment parsing from Outlook PST archives
  • Concurrent PDF OCR — Page-by-page parallel OCR for large PDFs, significantly reducing processing time
  • ZIP handling — Recursive content extraction and conversion of all supported files within archives
  • Graceful fallbacks — Each conversion method fails gracefully, allowing the pipeline to try the next method without crashing

Requirements

RequirementValue
Python>= 3.12.2
LicenseMIT
PublisherKinetic Solutions Group LLC
PyPImarkitdown-pro

Optional dependencies for full functionality:

  • Azure Document Intelligence SDK — Required for OCR-based conversion
  • Azure Speech SDK — Required for audio transcription
  • OpenAI Python SDK — Required for GPT-4o-mini vision fallback and Whisper transcription
  • Unstructured — Required for advanced document parsing

On this page