How to Use Docling for OCR and Document Parsing in OpenRAG
OpenRAG delegates OCR and document parsing to docling-serve, a local HTTP service managed by the DoclingManager and consumed via the DoclingClient utility.
OpenRAG does not bundle its own OCR engine. Instead, it integrates with Docling through a lightweight service architecture that handles document conversion and text extraction. This guide explains how to configure, start, and use Docling for OCR and document parsing in OpenRAG based on the actual implementation in the langflow-ai/openrag repository.
Architecture Overview
The integration relies on three core components that communicate with a running docling-serve instance:
- DoclingClient (
src/utils/docling_client.py) – Async HTTP wrapper that sends files to the/v1/convert/fileendpoint and returns parsed JSON. - DoclingManager (
src/tui/managers/docling_manager.py) – Singleton process manager that starts, monitors, and persists thedocling-servesubprocess across TUI sessions. - Docling API Proxy (
src/api/docling.py) – FastAPI health check endpoint that forwards requests to the running service.
Starting and Managing the Docling Service
Using the CLI Controller
OpenRAG provides a command-line interface for managing the Docling service lifecycle through scripts/docling_ctl.py:
# Start docling-serve on default port 5001
python -m scripts.docling_ctl start
# Start with multiple workers and UI enabled
python -m scripts.docling_ctl start --workers 2 --enable-ui
# Check service status
python -m scripts.docling_ctl status
# Stop the service
python -m scripts.docling_ctl stop
Process Persistence and Lifecycle Management
The DoclingManager class handles process persistence by writing the PID to ~/.openrag/tui/.docling.pid. When starting, it checks for an existing PID and reattaches to running processes rather than spawning duplicates. This allows the OCR service to survive across multiple TUI sessions without reinstalling heavy dependencies.
For CI pipelines, use the warmup script to block until healthy:
python warm_up_docling.py
This script polls the health endpoint until docling-serve responds or the timeout (controlled by DOCLING_WARMUP_TIMEOUT) expires.
Converting Documents with DoclingClient
Converting Local Files
The convert_file() function in src/utils/docling_client.py handles async HTTP POST requests to the Docling service:
import httpx
from utils.docling_client import convert_file, DoclingServeError
async def parse_pdf(file_path: str):
async with httpx.AsyncClient() as client:
try:
result = await convert_file(
file_path,
httpx_client=client
)
# result contains the parsed DoclingDocument JSON
return result
except DoclingServeError as e:
print(f"Conversion failed: {e}")
Processing In-Memory Bytes
For streams or uploaded files already in memory, use convert_bytes():
from utils.docling_client import convert_bytes
async def parse_bytes(content: bytes, filename: str):
async with httpx.AsyncClient() as client:
document = await convert_bytes(
content,
filename,
httpx_client=client
)
return document
Both methods post to {DOCLING_SERVICE_URL}/v1/convert/file and return the JSON content extracted by the OCR engine.
Health Monitoring and API Proxy
The OpenRAG frontend checks service availability through a proxy endpoint rather than contacting docling-serve directly:
// Frontend health check
fetch("/api/docling/health")
.then(r => r.json())
.then(data => console.log("Service status:", data));
The proxy in src/api/docling.py forwards this to the underlying service and handles timeouts gracefully, returning HTTP 503 with {status: "unhealthy"} if the service is unreachable.
Configuration and Environment Variables
Control the Docling integration through these environment variables:
| Variable | Description | Default |
|---|---|---|
DOCLING_SERVE_URL |
Base URL for existing docling-serve instance | Auto-detected |
DOCLING_OCR_ENGINE |
OCR engine selection (tesseract, easyocr, etc.) |
None (OCR disabled) |
DOCLING_WORKERS |
Concurrent worker processes | 1 |
DOCLING_BIND_HOST |
Network interface binding | 0.0.0.0 |
DOCLING_WARMUP_TIMEOUT |
Health check wait duration (seconds) | 120 |
Setting DOCLING_SERVE_URL bypasses the local process management and connects to an external service, useful for containerized deployments.
Summary
- OpenRAG uses docling-serve as an external HTTP service rather than embedding OCR directly.
- The DoclingManager (
src/tui/managers/docling_manager.py) handles process lifecycle and PID persistence across sessions. - The DoclingClient (
src/utils/docling_client.py) providesconvert_file()andconvert_bytes()for async document conversion. - Environment variables like
DOCLING_SERVE_URLandDOCLING_OCR_ENGINEcontrol service location and OCR behavior. - Health monitoring flows through the API proxy (
src/api/docling.py) to provide frontend visibility into service status.
Frequently Asked Questions
How do I enable OCR when starting the Docling service?
Set the DOCLING_OCR_ENGINE environment variable to your preferred engine before starting the service. For example, export DOCLING_OCR_ENGINE=tesseract enables Tesseract OCR. If this variable is unset, docling-serve runs without OCR capabilities and extracts only embedded text.
Can I use an existing Docling server instead of letting OpenRAG manage the process?
Yes. Set the DOCLING_SERVE_URL environment variable to the base URL of your running instance (e.g., http://docling-service:5001). When this variable is present, OpenRAG skips the auto-start logic in DoclingManager and connects directly to the specified endpoint for all conversion requests.
What happens if the Docling service crashes during a conversion?
The DoclingClient raises a DoclingServeError exception when it encounters connection failures, timeouts, or non-200 HTTP responses. Your application code should catch this exception and handle retries or fallback logic. The DoclingManager status command can verify if the process is still running before you attempt reconversion.
How do I process documents already loaded in memory rather than files on disk?
Use the convert_bytes() function from src/utils/docling_client.py instead of convert_file(). Pass the byte content and a filename string (used for content-type detection) along with your httpx.AsyncClient instance. This approach works for uploaded files, generated reports, or any binary stream without writing to the filesystem.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →