# How the ONNX Embeddings Model Runs Locally and Keeps Data Private in MCP Memory Service

> Learn how the ONNX embeddings model runs locally in MCP Memory Service using cached models for private, on-device inference. No network calls, just secure data processing.

- Repository: [Henry/mcp-memory-service](https://github.com/doobidoo/mcp-memory-service)
- Tags: internals
- Published: 2026-02-28

---

**MCP Memory Service uses an ONNX Runtime-based embedding pipeline that downloads the all-MiniLM-L6-v2 model once, caches it locally, and performs inference entirely on-device without network calls, ensuring complete data privacy.**

The `doobidoo/mcp-memory-service` repository provides an optional ONNX-based embedding pipeline that generates vector representations of text completely offline. This architecture ensures that sensitive data never leaves the host machine while maintaining high-performance semantic search capabilities.

## Local ONNX Runtime Architecture

The embedding system is built around dependency-free inference components that operate without external API calls.

### Dependency-Free Inference Engine

The pipeline uses **ONNX Runtime** (`ort`) as its inference engine, eliminating the need for heavy PyTorch dependencies. In [`src/mcp_memory_service/embeddings/onnx_embeddings.py`](https://github.com/doobidoo/mcp-memory-service/blob/main/src/mcp_memory_service/embeddings/onnx_embeddings.py), the implementation imports `onnxruntime` and the `tokenizers` library to handle model execution and text processing. This design allows the service to run on CPU, CUDA, DirectML, or CoreML providers without installing full machine learning frameworks.

### Self-Contained Model Management

The system manages the *all-MiniLM-L6-v2* model as a self-contained artifact. On initialization, the code checks for the presence of `model.onnx` in the user's cache directory at `~/.cache/mcp_memory/onnx_models/all-MiniLM-L6-v2/onnx`. If absent, the `_download_model_if_needed()` function fetches the model archive from a public S3 bucket, verifies it against a hardcoded SHA-256 hash (`_MODEL_SHA256`), and extracts it using `_safe_tar_extract()` with tar-slip protection.

## How the Model Stays Private: The Encoding Pipeline

Data privacy is enforced through architectural guarantees that prevent network egress during the encoding process.

### Secure Download and Verification

The initial model acquisition happens once per installation. The `_download_model_if_needed()` method in [`onnx_embeddings.py`](https://github.com/doobidoo/mcp-memory-service/blob/main/onnx_embeddings.py) performs a SHA-256 checksum verification before extraction, ensuring the model integrity. The `_safe_tar_extract()` function validates that each archive member resolves within the target directory, preventing path traversal attacks during extraction.

### Zero-Network Inference

After caching, the `ONNXEmbeddingModel.encode()` method processes all text locally. The method tokenizes input using the cached [`tokenizer.json`](https://github.com/doobidoo/mcp-memory-service/blob/main/tokenizer.json), pads tensors to the longest sequence, and runs inference via `ort.InferenceSession`. The execution uses mean-pooling with attention masks and L2 normalization, returning a NumPy array of embeddings. No HTTP calls occur during this process; the data remains in host memory throughout the computation.

## Implementation Details and Code Paths

The privacy guarantees rely on specific implementation choices across the codebase:

- **Optional Import Handling**: Lines 17-30 in [`onnx_embeddings.py`](https://github.com/doobidoo/mcp-memory-service/blob/main/onnx_embeddings.py) import ONNX Runtime and tokenizers with graceful degradation if dependencies are missing.

- **Provider Selection**: The `get_onnx_embedding_model()` function (lines 38-53) builds an execution provider list prioritizing CUDA, DirectML, CoreML, and falling back to CPU.

- **Device Reporting**: The `device` property returns `"cpu"` for API compatibility, though ONNX Runtime internally selects the optimal provider.

- **Environment Control**: The service checks `MCP_MEMORY_USE_ONNX` to toggle the ONNX pipeline, documented in [`docs/mastery/configuration-guide.md`](https://github.com/doobidoo/mcp-memory-service/blob/main/docs/mastery/configuration-guide.md).

## Practical Usage Example

Enable and use the local ONNX embedding model with the following pattern:

```python
from mcp_memory_service.embeddings.onnx_embeddings import get_onnx_embedding_model

# Ensure ONNX is enabled via environment or configuration

# export MCP_MEMORY_USE_ONNX=true

# Initialize the model (downloads on first run, caches thereafter)

model = get_onnx_embedding_model()
if model is None:
    raise RuntimeError("ONNX embedding model unavailable – verify onnxruntime and tokenizers are installed.")

# Encode text locally – no network calls made

texts = [
    "How does ONNX keep my data private?",
    "MCP Memory Service runs entirely offline."
]
embeddings = model.encode(texts)  # Returns shape (2, 384) for MiniLM-L6-v2

print(f"Embeddings shape: {embeddings.shape}")
print(f"First 5 dimensions: {embeddings[0][:5]}")

```

This example demonstrates the complete local workflow: initialization checks the cache at `~/.cache/mcp_memory/onnx_models/`, downloads only if necessary, and processes all embeddings through `ONNXEmbeddingModel.encode()` without external API calls.

## Summary

- **Local Execution**: The ONNX embedding pipeline uses ONNX Runtime and the `tokenizers` library to run *all-MiniLM-L6-v2* entirely on-device, with no PyTorch dependencies.

- **Secure Caching**: The model downloads once to `~/.cache/mcp_memory/onnx_models/`, verified via SHA-256 checksum and extracted with tar-slip protection via `_safe_tar_extract()`.

- **Zero-Network Inference**: After caching, `ONNXEmbeddingModel.encode()` tokenizes, pads, and runs inference locally using `ort.InferenceSession`, ensuring raw text never leaves the host machine.

- **Flexible Hardware**: The implementation automatically selects available execution providers (CUDA, DirectML, CoreML, CPU) while maintaining privacy across all configurations.

## Frequently Asked Questions

### Does the MCP Memory Service send my text to OpenAI or any external API when using ONNX embeddings?

No. When using the ONNX embedding pipeline, all text processing happens locally on your machine. The `ONNXEmbeddingModel.encode()` method in [`src/mcp_memory_service/embeddings/onnx_embeddings.py`](https://github.com/doobidoo/mcp-memory-service/blob/main/src/mcp_memory_service/embeddings/onnx_embeddings.py) runs tokenization and inference via ONNX Runtime without making any HTTP calls. Your raw text never leaves the host.

### What happens if the ONNX model isn't downloaded yet? Will my data be sent somewhere else?

If the model is not present in `~/.cache/mcp_memory/onnx_models/`, the service downloads the model archive from a public S3 bucket once during initialization. This download fetches only the model weights, not your data. The `_download_model_if_needed()` function verifies the download using a SHA-256 hash before extraction. Once cached, no further network activity occurs during encoding.

### Can I use GPU acceleration with the local ONNX embeddings while keeping data private?

Yes. The `get_onnx_embedding_model()` function automatically configures ONNX Runtime to use available GPU providers such as CUDA, DirectML, or CoreML before falling back to CPU. Because ONNX Runtime runs locally on your machine, using GPU acceleration does not compromise privacy—your data remains in local memory and never traverses the network regardless of which execution provider is active.

### How do I verify that my MCP Memory Service instance is actually using the local ONNX model and not a cloud-based alternative?

Check that the `MCP_MEMORY_USE_ONNX` environment variable is set to `true` and that the `get_onnx_embedding_model()` function returns a valid `ONNXEmbeddingModel` instance rather than `None`. You can also verify the presence of cached files in `~/.cache/mcp_memory/onnx_models/all-MiniLM-L6-v2/onnx/`. When active, the service logs will indicate local ONNX Runtime initialization rather than external API client initialization.