Index Processor

The index processor is the data-plane worker that turns source files into searchable chunks. It is event-driven via Dapr and Azure Service Bus sessions, load-aware of Cosmos DB capacity, and self-cleaning of stale or missing content. One index processor instance is deployed per index.

Overview

Each index processor subscribes to a dedicated Service Bus topic for its index. When files are added, renamed, or deleted from the connected blob storage, the control plane dispatches work messages to the topic. The processor picks up those messages, converts files to markdown, chunks them, generates embeddings, and stores the resulting vectors in a per-index Cosmos DB container.

Session-based ordering on the Service Bus topic guarantees that messages for the same file are processed sequentially, preventing race conditions during rapid updates.

Processing Pipeline

The full pipeline for indexing a single file proceeds through these steps:

Receive work message — The processor receives a message from the per-index Service Bus topic via a Dapr subscription. Messages use session-based ordering to ensure sequential processing per file.
Initialize run state — The index status is set to RUNNING and an IndexRun record is created to track progress, counters, and timing for this processing batch.
Download source file — The file is downloaded from Azure Blob Storage using the blob URL included in the work message.
Convert to Markdown — The file is passed through markitdown_pro.ConversionPipeline, which tries multiple conversion methods in sequence (MarkItDown, Unstructured.io, Azure Document Intelligence, GPT-4o-mini) until sufficient content is extracted. See markitdown-pro for details.
Split into chunks — The markdown output is split into chunks using LangChain splitters selected by file type:
- Markdown files — RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN)
- HTML files — HTMLHeaderTextSplitter (splits on header hierarchy)
- JSON files — RecursiveJsonSplitter (preserves JSON structure)
- Embedding-based splitting — SemanticChunker (splits on semantic similarity boundaries using embeddings)
- All other files — RecursiveCharacterTextSplitter with markdown language hints
Generate embeddings — Chunks are sent to Azure OpenAI in batches. The batch size is controlled by EMBEDDING_BATCH_SIZE. A random 0--5 second delay is injected between batch calls to avoid spiking the Azure OpenAI endpoint and triggering rate limits.
Upsert chunks — Old chunks for the file are deleted from the per-index Cosmos DB container, then new chunks are upserted. This delete-then-insert approach ensures clean state without orphaned chunks.
Finalize file status — The file is marked as INDEXED, and the IndexRun counters (files processed, chunks created, errors) are updated.

Scaling

Index processors run as Azure Container Apps with a Dapr sidecar. They scale automatically based on Service Bus topic message count.

Parameter	Value
Min replicas	0 (scale to zero when idle)
Max replicas	Configurable per index (default 5)
Scale trigger	`INDEX_PROCESSOR_SCALE_OUT_MESSAGE_COUNT` (default 10)
Polling interval	5 seconds
Cooldown period	600 seconds
CPU per replica	2 vCPU
Memory per replica	4 Gi

When no messages are pending, the processor scales to zero replicas. As messages arrive, Container Apps polls the topic every 5 seconds and adds replicas when the message count exceeds the configured threshold. After the queue drains, the 600-second cooldown prevents premature scale-down during bursty workloads.

File Lifecycle Events

The index processor handles all blob lifecycle events dispatched by the control plane:

Event	Action
`BlobCreated`	Index the new file — convert, chunk, embed, and store vectors
`BlobRenamed`	Re-index under the new name and clean up all chunks associated with the old name
`BlobDeleted`	Remove all chunks for that file from the Cosmos container
`DirectoryCreated`	Index all files within the new directory
`DirectoryDeleted`	Remove all chunks for every file in the deleted directory
`DirectoryRenamed`	Re-index all files under the new directory path, clean up old chunks

Unsupported File Types

The following file types are blocked at dispatch and never reach the processor:

Extension	Category	Reason
`.mp4`, `.mov`, `.wmv`, `.avi`	Video	No video-to-text conversion available
`.mp3`	Audio	Blocked at dispatch, but note that MP3 is natively supported by markitdown-pro for transcription via Azure Speech Services and OpenAI Whisper
`.doc`	Legacy Word	Use `.docx` instead
`.xls`	Legacy Excel	Use `.xlsx` instead
`.ppt`	Legacy PowerPoint	Use `.pptx` instead

Legacy Office formats without the x suffix use the older binary format that is not reliably parseable. Users should convert these to their modern equivalents before uploading.

Error Handling

The processor uses smart retry logic with exponential backoff for transient failures:

Max retries per file — Controlled by FILE_INDEXING_MAX_RETRIES. After exceeding the retry limit, the file is marked as FAILED and skipped in subsequent runs.
Failure count tracking — Each file tracks its cumulative failure count. This persists across runs so that consistently failing files do not block processing indefinitely.
Stale file detection — If a file remains in PROCESSING state for longer than INDEXER_STALE_FILE_TIMEOUT_MINUTES (default 15 minutes), it is considered stale. The processor resets its status to PENDING so it can be retried on the next run.
Exponential backoff — Retry delays increase exponentially to avoid hammering downstream services (Azure OpenAI, Cosmos DB) during outages.

Telemetry

The processor emits custom histogram metrics for monitoring pipeline performance:

Metric	Description
`fileMarkdownTime`	Time to convert a file to markdown
`embeddingTime`	Time to generate embeddings (tagged with text length)
`fileIndexingTime`	Total end-to-end time for indexing a single file
`fileDeletionTime`	Time to delete a file and its chunks
`chunksUpsertTime`	Time to upsert chunks to Cosmos (tagged with chunk count)
`chunksDeleteTime`	Time to delete old chunks from Cosmos
`vectorQueryRUs`	Cosmos DB request units consumed by vector queries

These metrics integrate with Azure Monitor and can be used to set up alerts for degraded performance or capacity planning.

Manual Triggers

In addition to event-driven processing, you can trigger indexing runs manually via the API:

POST /indexes/{indexId}/run

Processes only files with PENDING status. Use this to retry files that previously failed or to kick off processing after bulk uploads.

POST /indexes/{indexId}/run?full=true

Forces a full reindex of all files in the index, regardless of their current status. Every file is re-downloaded, re-converted, re-chunked, and re-embedded. Use this after changing chunking strategies, embedding models, or when you suspect data corruption.

Index Processor

On this page