RAG.DB
Index Processor

Index Processor

Data-plane worker that transforms source files into searchable chunks

The index processor is the data-plane worker that turns source files into searchable chunks. It is event-driven via Dapr and Azure Service Bus sessions, load-aware of Cosmos DB capacity, and self-cleaning of stale or missing content. One index processor instance is deployed per index.

Overview

Each index processor subscribes to a dedicated Service Bus topic for its index. When files are added, renamed, or deleted from the connected blob storage, the control plane dispatches work messages to the topic. The processor picks up those messages, converts files to markdown, chunks them, generates embeddings, and stores the resulting vectors in a per-index Cosmos DB container.

Session-based ordering on the Service Bus topic guarantees that messages for the same file are processed sequentially, preventing race conditions during rapid updates.

Processing Pipeline

The full pipeline for indexing a single file proceeds through these steps:

  1. Receive work message — The processor receives a message from the per-index Service Bus topic via a Dapr subscription. Messages use session-based ordering to ensure sequential processing per file.

  2. Initialize run state — The index status is set to RUNNING and an IndexRun record is created to track progress, counters, and timing for this processing batch.

  3. Download source file — The file is downloaded from Azure Blob Storage using the blob URL included in the work message.

  4. Convert to Markdown — The file is passed through markitdown_pro.ConversionPipeline, which tries multiple conversion methods in sequence (MarkItDown, Unstructured.io, Azure Document Intelligence, GPT-4o-mini) until sufficient content is extracted. See markitdown-pro for details.

  5. Split into chunks — The markdown output is split into chunks using LangChain splitters selected by file type:

    • Markdown filesRecursiveCharacterTextSplitter.from_language(Language.MARKDOWN)
    • HTML filesHTMLHeaderTextSplitter (splits on header hierarchy)
    • JSON filesRecursiveJsonSplitter (preserves JSON structure)
    • Embedding-based splittingSemanticChunker (splits on semantic similarity boundaries using embeddings)
    • All other filesRecursiveCharacterTextSplitter with markdown language hints
  6. Generate embeddings — Chunks are sent to Azure OpenAI in batches. The batch size is controlled by EMBEDDING_BATCH_SIZE. A random 0--5 second delay is injected between batch calls to avoid spiking the Azure OpenAI endpoint and triggering rate limits.

  7. Upsert chunks — Old chunks for the file are deleted from the per-index Cosmos DB container, then new chunks are upserted. This delete-then-insert approach ensures clean state without orphaned chunks.

  8. Finalize file status — The file is marked as INDEXED, and the IndexRun counters (files processed, chunks created, errors) are updated.

Scaling

Index processors run as Azure Container Apps with a Dapr sidecar. They scale automatically based on Service Bus topic message count.

ParameterValue
Min replicas0 (scale to zero when idle)
Max replicasConfigurable per index (default 5)
Scale triggerINDEX_PROCESSOR_SCALE_OUT_MESSAGE_COUNT (default 10)
Polling interval5 seconds
Cooldown period600 seconds
CPU per replica2 vCPU
Memory per replica4 Gi

When no messages are pending, the processor scales to zero replicas. As messages arrive, Container Apps polls the topic every 5 seconds and adds replicas when the message count exceeds the configured threshold. After the queue drains, the 600-second cooldown prevents premature scale-down during bursty workloads.

File Lifecycle Events

The index processor handles all blob lifecycle events dispatched by the control plane:

EventAction
BlobCreatedIndex the new file — convert, chunk, embed, and store vectors
BlobRenamedRe-index under the new name and clean up all chunks associated with the old name
BlobDeletedRemove all chunks for that file from the Cosmos container
DirectoryCreatedIndex all files within the new directory
DirectoryDeletedRemove all chunks for every file in the deleted directory
DirectoryRenamedRe-index all files under the new directory path, clean up old chunks

Unsupported File Types

The following file types are blocked at dispatch and never reach the processor:

ExtensionCategoryReason
.mp4, .mov, .wmv, .aviVideoNo video-to-text conversion available
.mp3AudioBlocked at dispatch, but note that MP3 is natively supported by markitdown-pro for transcription via Azure Speech Services and OpenAI Whisper
.docLegacy WordUse .docx instead
.xlsLegacy ExcelUse .xlsx instead
.pptLegacy PowerPointUse .pptx instead

Legacy Office formats without the x suffix use the older binary format that is not reliably parseable. Users should convert these to their modern equivalents before uploading.

Error Handling

The processor uses smart retry logic with exponential backoff for transient failures:

  • Max retries per file — Controlled by FILE_INDEXING_MAX_RETRIES. After exceeding the retry limit, the file is marked as FAILED and skipped in subsequent runs.
  • Failure count tracking — Each file tracks its cumulative failure count. This persists across runs so that consistently failing files do not block processing indefinitely.
  • Stale file detection — If a file remains in PROCESSING state for longer than INDEXER_STALE_FILE_TIMEOUT_MINUTES (default 15 minutes), it is considered stale. The processor resets its status to PENDING so it can be retried on the next run.
  • Exponential backoff — Retry delays increase exponentially to avoid hammering downstream services (Azure OpenAI, Cosmos DB) during outages.

Telemetry

The processor emits custom histogram metrics for monitoring pipeline performance:

MetricDescription
fileMarkdownTimeTime to convert a file to markdown
embeddingTimeTime to generate embeddings (tagged with text length)
fileIndexingTimeTotal end-to-end time for indexing a single file
fileDeletionTimeTime to delete a file and its chunks
chunksUpsertTimeTime to upsert chunks to Cosmos (tagged with chunk count)
chunksDeleteTimeTime to delete old chunks from Cosmos
vectorQueryRUsCosmos DB request units consumed by vector queries

These metrics integrate with Azure Monitor and can be used to set up alerts for degraded performance or capacity planning.

Manual Triggers

In addition to event-driven processing, you can trigger indexing runs manually via the API:

POST /indexes/{indexId}/run

Processes only files with PENDING status. Use this to retry files that previously failed or to kick off processing after bulk uploads.

POST /indexes/{indexId}/run?full=true

Forces a full reindex of all files in the index, regardless of their current status. Every file is re-downloaded, re-converted, re-chunked, and re-embedded. Use this after changing chunking strategies, embedding models, or when you suspect data corruption.

On this page