How the ONNX Embeddings Model Runs Locally and Keeps Data Private in MCP Memory Service
MCP Memory Service uses an ONNX Runtime-based embedding pipeline that downloads the all-MiniLM-L6-v2 model once, caches it locally, and performs inference entirely on-device without network calls, ensuring complete data privacy.
The doobidoo/mcp-memory-service repository provides an optional ONNX-based embedding pipeline that generates vector representations of text completely offline. This architecture ensures that sensitive data never leaves the host machine while maintaining high-performance semantic search capabilities.
Local ONNX Runtime Architecture
The embedding system is built around dependency-free inference components that operate without external API calls.
Dependency-Free Inference Engine
The pipeline uses ONNX Runtime (ort) as its inference engine, eliminating the need for heavy PyTorch dependencies. In src/mcp_memory_service/embeddings/onnx_embeddings.py, the implementation imports onnxruntime and the tokenizers library to handle model execution and text processing. This design allows the service to run on CPU, CUDA, DirectML, or CoreML providers without installing full machine learning frameworks.
Self-Contained Model Management
The system manages the all-MiniLM-L6-v2 model as a self-contained artifact. On initialization, the code checks for the presence of model.onnx in the user's cache directory at ~/.cache/mcp_memory/onnx_models/all-MiniLM-L6-v2/onnx. If absent, the _download_model_if_needed() function fetches the model archive from a public S3 bucket, verifies it against a hardcoded SHA-256 hash (_MODEL_SHA256), and extracts it using _safe_tar_extract() with tar-slip protection.
How the Model Stays Private: The Encoding Pipeline
Data privacy is enforced through architectural guarantees that prevent network egress during the encoding process.
Secure Download and Verification
The initial model acquisition happens once per installation. The _download_model_if_needed() method in onnx_embeddings.py performs a SHA-256 checksum verification before extraction, ensuring the model integrity. The _safe_tar_extract() function validates that each archive member resolves within the target directory, preventing path traversal attacks during extraction.
Zero-Network Inference
After caching, the ONNXEmbeddingModel.encode() method processes all text locally. The method tokenizes input using the cached tokenizer.json, pads tensors to the longest sequence, and runs inference via ort.InferenceSession. The execution uses mean-pooling with attention masks and L2 normalization, returning a NumPy array of embeddings. No HTTP calls occur during this process; the data remains in host memory throughout the computation.
Implementation Details and Code Paths
The privacy guarantees rely on specific implementation choices across the codebase:
-
Optional Import Handling: Lines 17-30 in
onnx_embeddings.pyimport ONNX Runtime and tokenizers with graceful degradation if dependencies are missing. -
Provider Selection: The
get_onnx_embedding_model()function (lines 38-53) builds an execution provider list prioritizing CUDA, DirectML, CoreML, and falling back to CPU. -
Device Reporting: The
deviceproperty returns"cpu"for API compatibility, though ONNX Runtime internally selects the optimal provider. -
Environment Control: The service checks
MCP_MEMORY_USE_ONNXto toggle the ONNX pipeline, documented indocs/mastery/configuration-guide.md.
Practical Usage Example
Enable and use the local ONNX embedding model with the following pattern:
from mcp_memory_service.embeddings.onnx_embeddings import get_onnx_embedding_model
# Ensure ONNX is enabled via environment or configuration
# export MCP_MEMORY_USE_ONNX=true
# Initialize the model (downloads on first run, caches thereafter)
model = get_onnx_embedding_model()
if model is None:
raise RuntimeError("ONNX embedding model unavailable – verify onnxruntime and tokenizers are installed.")
# Encode text locally – no network calls made
texts = [
"How does ONNX keep my data private?",
"MCP Memory Service runs entirely offline."
]
embeddings = model.encode(texts) # Returns shape (2, 384) for MiniLM-L6-v2
print(f"Embeddings shape: {embeddings.shape}")
print(f"First 5 dimensions: {embeddings[0][:5]}")
This example demonstrates the complete local workflow: initialization checks the cache at ~/.cache/mcp_memory/onnx_models/, downloads only if necessary, and processes all embeddings through ONNXEmbeddingModel.encode() without external API calls.
Summary
-
Local Execution: The ONNX embedding pipeline uses ONNX Runtime and the
tokenizerslibrary to run all-MiniLM-L6-v2 entirely on-device, with no PyTorch dependencies. -
Secure Caching: The model downloads once to
~/.cache/mcp_memory/onnx_models/, verified via SHA-256 checksum and extracted with tar-slip protection via_safe_tar_extract(). -
Zero-Network Inference: After caching,
ONNXEmbeddingModel.encode()tokenizes, pads, and runs inference locally usingort.InferenceSession, ensuring raw text never leaves the host machine. -
Flexible Hardware: The implementation automatically selects available execution providers (CUDA, DirectML, CoreML, CPU) while maintaining privacy across all configurations.
Frequently Asked Questions
Does the MCP Memory Service send my text to OpenAI or any external API when using ONNX embeddings?
No. When using the ONNX embedding pipeline, all text processing happens locally on your machine. The ONNXEmbeddingModel.encode() method in src/mcp_memory_service/embeddings/onnx_embeddings.py runs tokenization and inference via ONNX Runtime without making any HTTP calls. Your raw text never leaves the host.
What happens if the ONNX model isn't downloaded yet? Will my data be sent somewhere else?
If the model is not present in ~/.cache/mcp_memory/onnx_models/, the service downloads the model archive from a public S3 bucket once during initialization. This download fetches only the model weights, not your data. The _download_model_if_needed() function verifies the download using a SHA-256 hash before extraction. Once cached, no further network activity occurs during encoding.
Can I use GPU acceleration with the local ONNX embeddings while keeping data private?
Yes. The get_onnx_embedding_model() function automatically configures ONNX Runtime to use available GPU providers such as CUDA, DirectML, or CoreML before falling back to CPU. Because ONNX Runtime runs locally on your machine, using GPU acceleration does not compromise privacy—your data remains in local memory and never traverses the network regardless of which execution provider is active.
How do I verify that my MCP Memory Service instance is actually using the local ONNX model and not a cloud-based alternative?
Check that the MCP_MEMORY_USE_ONNX environment variable is set to true and that the get_onnx_embedding_model() function returns a valid ONNXEmbeddingModel instance rather than None. You can also verify the presence of cached files in ~/.cache/mcp_memory/onnx_models/all-MiniLM-L6-v2/onnx/. When active, the service logs will indicate local ONNX Runtime initialization rather than external API client initialization.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →