internals

What Happens When a Node Fails or Becomes Unavailable During a Signing Operation

March 2, 2026 fystack/mpcium ↗

When a node fails during an MPC signing operation, the fystack/mpcium stack immediately aborts the session, returns an ErrorCodePeerUnavailable error to the client, updates the cluster readiness registry, and securely wipes all sensitive session data.

Distributed Multi-Party Computation (MPC) signing requires continuous participation from all nodes to prevent partial signature leakage and ensure cryptographic integrity. When a node becomes unavailable during a signing operation in the fystack/mpcium repository, the system implements a deterministic five-stage failure path that prioritizes immediate termination over optimistic completion.

Message Delivery Failure Detection

The signing protocol detects node unavailability at the network layer through point-to-point message delivery failures. During an active session, the signing party sends TSS protocol messages via the DirectMessaging interface implemented in pkg/mpc/session.go.

When the destination node is unreachable or the NATS server reports no responders, the SendToOther call returns an error. The session wrapper immediately pushes this error onto the internal error channel:

s.ErrCh <- fmt.Errorf("failed to send direct message to %s", topic)

This detection occurs in real-time during the s.direct.SendToOther invocation, ensuring that transport-level failures trigger an abort before the MPC protocol advances to the next round.

Error Propagation from Session to Consumer

Once an error enters the session’s ErrCh, the eventConsumer component takes over. In pkg/eventconsumer/event_consumer.go (lines 71‑90), a dedicated watcher goroutine reads from session.ErrChan() and routes failures to handleSigningSessionError.

This function acts as the bridge between low-level transport errors and application-level event handling. It captures the error context—whether it originated from message sending, protocol violation, or timeout—and prepares it for client-facing translation.

Error Code Translation and Result Publishing

The consumer translates internal errors into standardized response codes using the mapping logic in pkg/event/types.go. Because the error string contains the word “send”, the GetErrorCodeFromError function classifies the failure as ErrorCodePeerUnavailable.

The handler then constructs a SigningResultEvent with ResultType: ResultTypeError and enqueues it onto the signing-result queue:

resultQueue.Enqueue(SigningResultEvent{
    ResultType: ResultTypeError,
    ErrorCode:  ErrorCodePeerUnavailable,
    // ... session metadata
})

Downstream services, including the API layer, consume this queue to return deterministic failure responses to clients, ensuring that callers receive a consistent error code regardless of which specific transport layer exception occurred.

Cluster-Wide Quorum Protection

Beyond the individual session, node failures impact cluster-wide signing availability. The registry component in pkg/mpc/registry.go continuously monitors peer health through periodic “ready” keys stored in Consul via WatchPeersReady.

When a node disappears, its ready key expires and the registry marks the peer as unavailable (readyMap[peerID] = false). Before any new signing operation begins, the signingConsumer.handleSigningEvent function (lines 61‑66 in pkg/eventconsumer/sign_consumer.go) validates the cluster state:

if !peerRegistry.AreMajorityReady() {
    return ErrorCodePeerUnavailable
}

This early-stage guard prevents the system from initiating new signing sessions when fewer than t+1 peers (the quorum threshold) are available, avoiding unnecessary resource allocation and client timeouts.

Secure Session Cleanup After Abort

When a session fails, the Close() method in pkg/mpc/session.go tears down NATS subscriptions and releases network resources. Critically, the security package in pkg/security/zeroize.go overwrites sensitive session data—including private key shares, transaction data (s.tx), and derived keys—before garbage collection.

This zeroization ensures that partial signature material or ephemeral protocol state is not retained in memory on the remaining nodes after a peer failure, mitigating risks associated with memory inspection or cold boot attacks.

Code Example: Observing Node Failure Errors

The following example demonstrates how to create a signing session and observe the error channel when a peer becomes unreachable:

// 1️⃣  Create a signing session (normally done by the eventConsumer)
sess, err := node.CreateSigningSession(
    mpc.SessionTypeECDSA,   // or SessionTypeEDDSA
    "wallet-123",           // wallet ID
    "tx-abc",               // transaction ID
    "net-internal-01",      // network‑internal code
    signingResultQueue,     // queue where the final result is posted
    []uint32{44, 0, 0},    // optional derivation path
    "signing-idempotent-key",
)
if err != nil {
    log.Fatalf("cannot create session: %v", err)
}

// 2️⃣  Initialise the session with the transaction data
tx := new(big.Int).SetBytes([]byte{0x01, 0x02, 0x03})
if err = sess.Init(tx); err != nil {
    log.Fatalf("cannot init signing: %v", err)
}

// 3️⃣  Listen for internal errors while the protocol runs
go func() {
    for err := range sess.ErrChan() {
        // The error string will contain “send …” if a peer cannot be reached.
        fmt.Printf("Signing error observed: %v\n", err)
    }
}()

// 4️⃣  Kick‑off the signing process
sess.Sign(func(sig []byte) {
    fmt.Printf("Signature produced: %x\n", sig)
})

// Output when nodeB fails:
// Signing error observed: failed to send direct message to sign:ecdsa:direct:nodeA:nodeB:tx-abc

When a node fails, the goroutine reading from sess.ErrChan() receives the transport error, which the eventConsumer subsequently converts to ErrorCodePeerUnavailable and publishes as a SigningResultEvent with ResultTypeError.

Summary

Immediate Detection: The SendToOther method in pkg/mpc/session.go detects transport failures and pushes errors to ErrCh when nodes are unreachable.
Standardized Errors: The eventConsumer maps send failures to ErrorCodePeerUnavailable via GetErrorCodeFromError in pkg/event/types.go.
Quorum Enforcement: The registry’s AreMajorityReady() check in pkg/eventconsumer/sign_consumer.go blocks new sessions until t+1 peers are available.
Secure Termination: Failed sessions trigger Close() and memory zeroization via pkg/security/zeroize.go to prevent key material leakage.
Deterministic Reporting: All failures propagate through SigningResultEvent structures, ensuring API clients receive consistent error codes.

Frequently Asked Questions

Can an MPC signing operation complete if one participant node fails mid-protocol?

No. The fystack/mpcium architecture enforces an all-or-nothing approach to signing. If any node becomes unreachable during the multi-round protocol, the SendToOther call fails, the error propagates through session.ErrCh, and the session aborts immediately. The design prevents partial signature generation that could leak information about the private key shares held by remaining nodes.

What specific error code indicates a node is unavailable during signing?

The system returns ErrorCodePeerUnavailable. This code is assigned in pkg/event/types.go when GetErrorCodeFromError detects the substring “send” in the error message originating from pkg/mpc/session.go. Clients receiving this code should treat the failure as transient and retry only after confirming the missing node has rejoined the cluster.

How does the system prevent new signing requests when nodes are down?

Before processing any signing event, the signingConsumer.handleSigningEvent function verifies cluster health by calling peerRegistry.AreMajorityReady(). This check ensures at least t+1 peers (the cryptographic threshold) are reporting ready status in Consul. If the quorum is not met, the request is rejected immediately with ErrorCodePeerUnavailable without allocating MPC session resources.

Is sensitive cryptographic material exposed when a signing session fails?

No. The session.Close() method triggers the zeroization routines in pkg/security/zeroize.go to overwrite memory regions containing private key shares, transaction data, and intermediate protocol state. This secure cleanup occurs on all remaining nodes before the failed session is garbage collected, ensuring that node outages do not leave exploitable artifacts in system memory.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how fystack/mpcium works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →