What Happens When a Node Fails or Becomes Unavailable During a Signing Operation
When a node fails during an MPC signing operation, the fystack/mpcium stack immediately aborts the session, returns an ErrorCodePeerUnavailable error to the client, updates the cluster readiness registry, and securely wipes all sensitive session data.
Distributed Multi-Party Computation (MPC) signing requires continuous participation from all nodes to prevent partial signature leakage and ensure cryptographic integrity. When a node becomes unavailable during a signing operation in the fystack/mpcium repository, the system implements a deterministic five-stage failure path that prioritizes immediate termination over optimistic completion.
Message Delivery Failure Detection
The signing protocol detects node unavailability at the network layer through point-to-point message delivery failures. During an active session, the signing party sends TSS protocol messages via the DirectMessaging interface implemented in pkg/mpc/session.go.
When the destination node is unreachable or the NATS server reports no responders, the SendToOther call returns an error. The session wrapper immediately pushes this error onto the internal error channel:
s.ErrCh <- fmt.Errorf("failed to send direct message to %s", topic)
This detection occurs in real-time during the s.direct.SendToOther invocation, ensuring that transport-level failures trigger an abort before the MPC protocol advances to the next round.
Error Propagation from Session to Consumer
Once an error enters the session’s ErrCh, the eventConsumer component takes over. In pkg/eventconsumer/event_consumer.go (lines 71‑90), a dedicated watcher goroutine reads from session.ErrChan() and routes failures to handleSigningSessionError.
This function acts as the bridge between low-level transport errors and application-level event handling. It captures the error context—whether it originated from message sending, protocol violation, or timeout—and prepares it for client-facing translation.
Error Code Translation and Result Publishing
The consumer translates internal errors into standardized response codes using the mapping logic in pkg/event/types.go. Because the error string contains the word “send”, the GetErrorCodeFromError function classifies the failure as ErrorCodePeerUnavailable.
The handler then constructs a SigningResultEvent with ResultType: ResultTypeError and enqueues it onto the signing-result queue:
resultQueue.Enqueue(SigningResultEvent{
ResultType: ResultTypeError,
ErrorCode: ErrorCodePeerUnavailable,
// ... session metadata
})
Downstream services, including the API layer, consume this queue to return deterministic failure responses to clients, ensuring that callers receive a consistent error code regardless of which specific transport layer exception occurred.
Cluster-Wide Quorum Protection
Beyond the individual session, node failures impact cluster-wide signing availability. The registry component in pkg/mpc/registry.go continuously monitors peer health through periodic “ready” keys stored in Consul via WatchPeersReady.
When a node disappears, its ready key expires and the registry marks the peer as unavailable (readyMap[peerID] = false). Before any new signing operation begins, the signingConsumer.handleSigningEvent function (lines 61‑66 in pkg/eventconsumer/sign_consumer.go) validates the cluster state:
if !peerRegistry.AreMajorityReady() {
return ErrorCodePeerUnavailable
}
This early-stage guard prevents the system from initiating new signing sessions when fewer than t+1 peers (the quorum threshold) are available, avoiding unnecessary resource allocation and client timeouts.
Secure Session Cleanup After Abort
When a session fails, the Close() method in pkg/mpc/session.go tears down NATS subscriptions and releases network resources. Critically, the security package in pkg/security/zeroize.go overwrites sensitive session data—including private key shares, transaction data (s.tx), and derived keys—before garbage collection.
This zeroization ensures that partial signature material or ephemeral protocol state is not retained in memory on the remaining nodes after a peer failure, mitigating risks associated with memory inspection or cold boot attacks.
Code Example: Observing Node Failure Errors
The following example demonstrates how to create a signing session and observe the error channel when a peer becomes unreachable:
// 1️⃣ Create a signing session (normally done by the eventConsumer)
sess, err := node.CreateSigningSession(
mpc.SessionTypeECDSA, // or SessionTypeEDDSA
"wallet-123", // wallet ID
"tx-abc", // transaction ID
"net-internal-01", // network‑internal code
signingResultQueue, // queue where the final result is posted
[]uint32{44, 0, 0}, // optional derivation path
"signing-idempotent-key",
)
if err != nil {
log.Fatalf("cannot create session: %v", err)
}
// 2️⃣ Initialise the session with the transaction data
tx := new(big.Int).SetBytes([]byte{0x01, 0x02, 0x03})
if err = sess.Init(tx); err != nil {
log.Fatalf("cannot init signing: %v", err)
}
// 3️⃣ Listen for internal errors while the protocol runs
go func() {
for err := range sess.ErrChan() {
// The error string will contain “send …” if a peer cannot be reached.
fmt.Printf("Signing error observed: %v\n", err)
}
}()
// 4️⃣ Kick‑off the signing process
sess.Sign(func(sig []byte) {
fmt.Printf("Signature produced: %x\n", sig)
})
// Output when nodeB fails:
// Signing error observed: failed to send direct message to sign:ecdsa:direct:nodeA:nodeB:tx-abc
When a node fails, the goroutine reading from sess.ErrChan() receives the transport error, which the eventConsumer subsequently converts to ErrorCodePeerUnavailable and publishes as a SigningResultEvent with ResultTypeError.
Summary
- Immediate Detection: The
SendToOthermethod inpkg/mpc/session.godetects transport failures and pushes errors toErrChwhen nodes are unreachable. - Standardized Errors: The
eventConsumermaps send failures toErrorCodePeerUnavailableviaGetErrorCodeFromErrorinpkg/event/types.go. - Quorum Enforcement: The registry’s
AreMajorityReady()check inpkg/eventconsumer/sign_consumer.goblocks new sessions untilt+1peers are available. - Secure Termination: Failed sessions trigger
Close()and memory zeroization viapkg/security/zeroize.goto prevent key material leakage. - Deterministic Reporting: All failures propagate through
SigningResultEventstructures, ensuring API clients receive consistent error codes.
Frequently Asked Questions
Can an MPC signing operation complete if one participant node fails mid-protocol?
No. The fystack/mpcium architecture enforces an all-or-nothing approach to signing. If any node becomes unreachable during the multi-round protocol, the SendToOther call fails, the error propagates through session.ErrCh, and the session aborts immediately. The design prevents partial signature generation that could leak information about the private key shares held by remaining nodes.
What specific error code indicates a node is unavailable during signing?
The system returns ErrorCodePeerUnavailable. This code is assigned in pkg/event/types.go when GetErrorCodeFromError detects the substring “send” in the error message originating from pkg/mpc/session.go. Clients receiving this code should treat the failure as transient and retry only after confirming the missing node has rejoined the cluster.
How does the system prevent new signing requests when nodes are down?
Before processing any signing event, the signingConsumer.handleSigningEvent function verifies cluster health by calling peerRegistry.AreMajorityReady(). This check ensures at least t+1 peers (the cryptographic threshold) are reporting ready status in Consul. If the quorum is not met, the request is rejected immediately with ErrorCodePeerUnavailable without allocating MPC session resources.
Is sensitive cryptographic material exposed when a signing session fails?
No. The session.Close() method triggers the zeroization routines in pkg/security/zeroize.go to overwrite memory regions containing private key shares, transaction data, and intermediate protocol state. This secure cleanup occurs on all remaining nodes before the failed session is garbage collected, ensuring that node outages do not leave exploitable artifacts in system memory.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →