Index Processor
Data-plane worker that transforms source files into searchable chunks
The index processor is the data-plane worker that turns source files into searchable chunks. It is event-driven via Dapr and Azure Service Bus sessions, load-aware of Cosmos DB capacity, and self-cleaning of stale or missing content. One index processor instance is deployed per index.
Overview
Each index processor subscribes to a dedicated Service Bus topic for its index. When files are added, renamed, or deleted from the connected blob storage, the control plane dispatches work messages to the topic. The processor picks up those messages, converts files to markdown, chunks them, generates embeddings, and stores the resulting vectors in a per-index Cosmos DB container.
Session-based ordering on the Service Bus topic guarantees that messages for the same file are processed sequentially, preventing race conditions during rapid updates.
Processing Pipeline
The full pipeline for indexing a single file proceeds through these steps:
-
Receive work message — The processor receives a message from the per-index Service Bus topic via a Dapr subscription. Messages use session-based ordering to ensure sequential processing per file.
-
Initialize run state — The index status is set to
RUNNINGand anIndexRunrecord is created to track progress, counters, and timing for this processing batch. -
Download source file — The file is downloaded from Azure Blob Storage using the blob URL included in the work message.
-
Convert to Markdown — The file is passed through
markitdown_pro.ConversionPipeline, which tries multiple conversion methods in sequence (MarkItDown, Unstructured.io, Azure Document Intelligence, GPT-4o-mini) until sufficient content is extracted. See markitdown-pro for details. -
Split into chunks — The markdown output is split into chunks using LangChain splitters selected by file type:
- Markdown files —
RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN) - HTML files —
HTMLHeaderTextSplitter(splits on header hierarchy) - JSON files —
RecursiveJsonSplitter(preserves JSON structure) - Embedding-based splitting —
SemanticChunker(splits on semantic similarity boundaries using embeddings) - All other files —
RecursiveCharacterTextSplitterwith markdown language hints
- Markdown files —
-
Generate embeddings — Chunks are sent to Azure OpenAI in batches. The batch size is controlled by
EMBEDDING_BATCH_SIZE. A random 0--5 second delay is injected between batch calls to avoid spiking the Azure OpenAI endpoint and triggering rate limits. -
Upsert chunks — Old chunks for the file are deleted from the per-index Cosmos DB container, then new chunks are upserted. This delete-then-insert approach ensures clean state without orphaned chunks.
-
Finalize file status — The file is marked as
INDEXED, and theIndexRuncounters (files processed, chunks created, errors) are updated.
Scaling
Index processors run as Azure Container Apps with a Dapr sidecar. They scale automatically based on Service Bus topic message count.
| Parameter | Value |
|---|---|
| Min replicas | 0 (scale to zero when idle) |
| Max replicas | Configurable per index (default 5) |
| Scale trigger | INDEX_PROCESSOR_SCALE_OUT_MESSAGE_COUNT (default 10) |
| Polling interval | 5 seconds |
| Cooldown period | 600 seconds |
| CPU per replica | 2 vCPU |
| Memory per replica | 4 Gi |
When no messages are pending, the processor scales to zero replicas. As messages arrive, Container Apps polls the topic every 5 seconds and adds replicas when the message count exceeds the configured threshold. After the queue drains, the 600-second cooldown prevents premature scale-down during bursty workloads.
File Lifecycle Events
The index processor handles all blob lifecycle events dispatched by the control plane:
| Event | Action |
|---|---|
BlobCreated | Index the new file — convert, chunk, embed, and store vectors |
BlobRenamed | Re-index under the new name and clean up all chunks associated with the old name |
BlobDeleted | Remove all chunks for that file from the Cosmos container |
DirectoryCreated | Index all files within the new directory |
DirectoryDeleted | Remove all chunks for every file in the deleted directory |
DirectoryRenamed | Re-index all files under the new directory path, clean up old chunks |
Unsupported File Types
The following file types are blocked at dispatch and never reach the processor:
| Extension | Category | Reason |
|---|---|---|
.mp4, .mov, .wmv, .avi | Video | No video-to-text conversion available |
.mp3 | Audio | Blocked at dispatch, but note that MP3 is natively supported by markitdown-pro for transcription via Azure Speech Services and OpenAI Whisper |
.doc | Legacy Word | Use .docx instead |
.xls | Legacy Excel | Use .xlsx instead |
.ppt | Legacy PowerPoint | Use .pptx instead |
Legacy Office formats without the x suffix use the older binary format that is not reliably parseable. Users should convert these to their modern equivalents before uploading.
Error Handling
The processor uses smart retry logic with exponential backoff for transient failures:
- Max retries per file — Controlled by
FILE_INDEXING_MAX_RETRIES. After exceeding the retry limit, the file is marked asFAILEDand skipped in subsequent runs. - Failure count tracking — Each file tracks its cumulative failure count. This persists across runs so that consistently failing files do not block processing indefinitely.
- Stale file detection — If a file remains in
PROCESSINGstate for longer thanINDEXER_STALE_FILE_TIMEOUT_MINUTES(default 15 minutes), it is considered stale. The processor resets its status toPENDINGso it can be retried on the next run. - Exponential backoff — Retry delays increase exponentially to avoid hammering downstream services (Azure OpenAI, Cosmos DB) during outages.
Telemetry
The processor emits custom histogram metrics for monitoring pipeline performance:
| Metric | Description |
|---|---|
fileMarkdownTime | Time to convert a file to markdown |
embeddingTime | Time to generate embeddings (tagged with text length) |
fileIndexingTime | Total end-to-end time for indexing a single file |
fileDeletionTime | Time to delete a file and its chunks |
chunksUpsertTime | Time to upsert chunks to Cosmos (tagged with chunk count) |
chunksDeleteTime | Time to delete old chunks from Cosmos |
vectorQueryRUs | Cosmos DB request units consumed by vector queries |
These metrics integrate with Azure Monitor and can be used to set up alerts for degraded performance or capacity planning.
Manual Triggers
In addition to event-driven processing, you can trigger indexing runs manually via the API:
POST /indexes/{indexId}/runProcesses only files with PENDING status. Use this to retry files that previously failed or to kick off processing after bulk uploads.
POST /indexes/{indexId}/run?full=trueForces a full reindex of all files in the index, regardless of their current status. Every file is re-downloaded, re-converted, re-chunked, and re-embedded. Use this after changing chunking strategies, embedding models, or when you suspect data corruption.