internals

How the ONNX Embeddings Model Runs Locally and Keeps Data Private in MCP Memory Service

February 28, 2026 doobidoo/mcp-memory-service ↗

MCP Memory Service uses an ONNX Runtime-based embedding pipeline that downloads the all-MiniLM-L6-v2 model once, caches it locally, and performs inference entirely on-device without network calls, ensuring complete data privacy.

The doobidoo/mcp-memory-service repository provides an optional ONNX-based embedding pipeline that generates vector representations of text completely offline. This architecture ensures that sensitive data never leaves the host machine while maintaining high-performance semantic search capabilities.

Local ONNX Runtime Architecture

The embedding system is built around dependency-free inference components that operate without external API calls.

Dependency-Free Inference Engine

The pipeline uses ONNX Runtime (ort) as its inference engine, eliminating the need for heavy PyTorch dependencies. In src/mcp_memory_service/embeddings/onnx_embeddings.py, the implementation imports onnxruntime and the tokenizers library to handle model execution and text processing. This design allows the service to run on CPU, CUDA, DirectML, or CoreML providers without installing full machine learning frameworks.

Self-Contained Model Management

The system manages the all-MiniLM-L6-v2 model as a self-contained artifact. On initialization, the code checks for the presence of model.onnx in the user's cache directory at ~/.cache/mcp_memory/onnx_models/all-MiniLM-L6-v2/onnx. If absent, the _download_model_if_needed() function fetches the model archive from a public S3 bucket, verifies it against a hardcoded SHA-256 hash (_MODEL_SHA256), and extracts it using _safe_tar_extract() with tar-slip protection.

How the Model Stays Private: The Encoding Pipeline

Data privacy is enforced through architectural guarantees that prevent network egress during the encoding process.

Secure Download and Verification

The initial model acquisition happens once per installation. The _download_model_if_needed() method in onnx_embeddings.py performs a SHA-256 checksum verification before extraction, ensuring the model integrity. The _safe_tar_extract() function validates that each archive member resolves within the target directory, preventing path traversal attacks during extraction.

Zero-Network Inference

After caching, the ONNXEmbeddingModel.encode() method processes all text locally. The method tokenizes input using the cached tokenizer.json, pads tensors to the longest sequence, and runs inference via ort.InferenceSession. The execution uses mean-pooling with attention masks and L2 normalization, returning a NumPy array of embeddings. No HTTP calls occur during this process; the data remains in host memory throughout the computation.

Implementation Details and Code Paths

The privacy guarantees rely on specific implementation choices across the codebase:

Optional Import Handling: Lines 17-30 in onnx_embeddings.py import ONNX Runtime and tokenizers with graceful degradation if dependencies are missing.
Provider Selection: The get_onnx_embedding_model() function (lines 38-53) builds an execution provider list prioritizing CUDA, DirectML, CoreML, and falling back to CPU.
Device Reporting: The device property returns "cpu" for API compatibility, though ONNX Runtime internally selects the optimal provider.
Environment Control: The service checks MCP_MEMORY_USE_ONNX to toggle the ONNX pipeline, documented in docs/mastery/configuration-guide.md.

Practical Usage Example

Enable and use the local ONNX embedding model with the following pattern:

from mcp_memory_service.embeddings.onnx_embeddings import get_onnx_embedding_model

# Ensure ONNX is enabled via environment or configuration

# export MCP_MEMORY_USE_ONNX=true

# Initialize the model (downloads on first run, caches thereafter)

model = get_onnx_embedding_model()
if model is None:
    raise RuntimeError("ONNX embedding model unavailable – verify onnxruntime and tokenizers are installed.")

# Encode text locally – no network calls made

texts = [
    "How does ONNX keep my data private?",
    "MCP Memory Service runs entirely offline."
]
embeddings = model.encode(texts)  # Returns shape (2, 384) for MiniLM-L6-v2

print(f"Embeddings shape: {embeddings.shape}")
print(f"First 5 dimensions: {embeddings[0][:5]}")

This example demonstrates the complete local workflow: initialization checks the cache at ~/.cache/mcp_memory/onnx_models/, downloads only if necessary, and processes all embeddings through ONNXEmbeddingModel.encode() without external API calls.

Summary

Local Execution: The ONNX embedding pipeline uses ONNX Runtime and the tokenizers library to run all-MiniLM-L6-v2 entirely on-device, with no PyTorch dependencies.
Secure Caching: The model downloads once to ~/.cache/mcp_memory/onnx_models/, verified via SHA-256 checksum and extracted with tar-slip protection via _safe_tar_extract().
Zero-Network Inference: After caching, ONNXEmbeddingModel.encode() tokenizes, pads, and runs inference locally using ort.InferenceSession, ensuring raw text never leaves the host machine.
Flexible Hardware: The implementation automatically selects available execution providers (CUDA, DirectML, CoreML, CPU) while maintaining privacy across all configurations.

Frequently Asked Questions

Does the MCP Memory Service send my text to OpenAI or any external API when using ONNX embeddings?

No. When using the ONNX embedding pipeline, all text processing happens locally on your machine. The ONNXEmbeddingModel.encode() method in src/mcp_memory_service/embeddings/onnx_embeddings.py runs tokenization and inference via ONNX Runtime without making any HTTP calls. Your raw text never leaves the host.

What happens if the ONNX model isn't downloaded yet? Will my data be sent somewhere else?

If the model is not present in ~/.cache/mcp_memory/onnx_models/, the service downloads the model archive from a public S3 bucket once during initialization. This download fetches only the model weights, not your data. The _download_model_if_needed() function verifies the download using a SHA-256 hash before extraction. Once cached, no further network activity occurs during encoding.

Can I use GPU acceleration with the local ONNX embeddings while keeping data private?

Yes. The get_onnx_embedding_model() function automatically configures ONNX Runtime to use available GPU providers such as CUDA, DirectML, or CoreML before falling back to CPU. Because ONNX Runtime runs locally on your machine, using GPU acceleration does not compromise privacy—your data remains in local memory and never traverses the network regardless of which execution provider is active.

How do I verify that my MCP Memory Service instance is actually using the local ONNX model and not a cloud-based alternative?

Check that the MCP_MEMORY_USE_ONNX environment variable is set to true and that the get_onnx_embedding_model() function returns a valid ONNXEmbeddingModel instance rather than None. You can also verify the presence of cached files in ~/.cache/mcp_memory/onnx_models/all-MiniLM-L6-v2/onnx/. When active, the service logs will indicate local ONNX Runtime initialization rather than external API client initialization.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how doobidoo/mcp-memory-service works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →