Recovering from Partial Failures in Enterprise MCP Tools

Image of agent calling multiple tools, after a tool encounters a failure, mid-execution

Distributed transactions fail partway through. Payment succeeds, then Salesforce times out. The guest is charged, but three systems hold stale state.

In production, this happens constantly: a system times out, a connection drops mid-request, a user submits unexpected input. In distributed systems, these failures often mean a transaction completes in one system and fails in another—requiring reconciliation to restore consistency across all systems.

What does this look like in composed MCP tools? When an LLM orchestrates multi-step workflows—potentially retrying, potentially calling the same tool multiple times—each tool represents a surface area for partial failure. Who enforces state reconciliation? How does that fit with the separation of concerns we’ve discussed in previous posts?

Idempotency handles retries. Error handling catches failures. Neither addresses partial success—when some operations complete before the failure point. Recovery requires knowing what succeeded and how to reverse it.

Previous posts covered composable skill design and serverless execution—how to structure tools and run them reliably. This post covers what happens when reliable execution still produces inconsistent state.

The Reference Architecture: Dewy Resort

Throughout this series, we use Dewy Resort—a hotel management system—as our reference implementation. The architecture spans multiple systems:

  • Salesforce: Guest records, bookings, room inventory, sales opportunities
  • Stripe: Payment processing, refunds
  • MCP Tools: Orchestration layer connecting these systems via composed workflows

A single guest action like “check out” triggers operations across both systems: charge payment in Stripe, update booking status in Salesforce, mark room for cleaning, close the sales opportunity. Each system commits on its own timeline. The orchestrator sequences operations but can’t provide atomic rollback across system boundaries.

The complete implementation is open source.


The Problem: Consistency Without Transactions

Enterprise workflows need ACID-like guarantees—either everything succeeds or everything rolls back. But there’s no shared transaction boundary spanning systems. Stripe commits. Salesforce commits. Each has its own state, its own failure modes.

What This Looks Like: Multi-Object State

In Dewy Resort, a single checkout action updates three related Salesforce objects:

Booking__c
  ├─> Hotel_Room__c (lookup)
  └─> Opportunity (lookup)

State dependencies:
- Booking.Status = "Checked Out"
  → Room.Status__c = "Cleaning"
  → Opportunity.StageName = "Closed Won"

When Booking transitions to “Checked Out,” Room and Opportunity must also transition. If any update fails, all three objects may be in inconsistent states—plus the Stripe charge has already succeeded.

Infographic showing expected state changes for Salesforce object updates during guest checkout, in Dewy Resort sample app implementation.

A Checkout Failure in Practice

Workflow: process_guest_checkout

StepOperationResult
1Search guest in Salesforce
2Create Stripe customer
3Create payment intent
4Confirm payment✓ ($250 charged)
5Update Salesforce booking✗ Timeout
6Update room status— Skipped
7Update opportunity— Skipped

Result: HTTP 500 returned to caller.

Actual state across systems:

SystemActualExpectedMatch
Stripe$250 charged$250 charged
BookingChecked InChecked Out
RoomOccupiedCleaning
OpportunityNegotiationClosed Won

Guest paid, checkout incomplete. Room can’t be reassigned. Sales report wrong. Manual reconciliation: 30+ minutes.

Infographic describing "partial failure" object states in Dewy Resort during a fictional guest checkout failure. Stripe charged $250, but all downstream Salesforce object updates failed.

Without a transaction boundary spanning systems, you have to build consistency yourself.

Why Try/Catch Isn’t Enough

A catch block logs the error and returns 500. It can’t distinguish between:

  • Failed before payment → safe to retry
  • Failed after payment → need refund or idempotent retry
  • Failed during Salesforce update → need reconciliation to determine state

Traditional error handling is binary: success or failure. Distributed workflows have partial success. Recovery requires knowing what succeeded and how to reverse it.


A Perspective: Decision Placement

This series has argued for a clear separation of concerns between LLMs and backend systems. That principle applies directly to recovery logic: who should decide when to compensate, and who should execute the compensation?

This isn’t established industry practice—it’s a perspective we’re advocating based on our experience building MCP tools for enterprise contexts.

The Principle

These aren’t fully autonomous systems. They’re agents assisting humans—so the real question is where human judgment belongs versus where deterministic execution belongs.

Humans + LLMs decide WHEN to act and WHICH workflow.
Backend workflows decide HOW to act and WHAT state transitions.
Financial and security decisions always go to backend.

Human + LLM Decisions

Non-deterministic, context-dependent—where judgment matters:

  • Understanding user intent (“I want to check out” → tool selection)
  • Selecting the right workflow (checkout vs compensation)
  • Extracting parameters from natural language
  • Confirming actions with users before execution
  • Explaining results to users

Backend-Appropriate Decisions

Deterministic, rule-based—where consistency matters:

  • Payment status routing (succeeded/requires_action/failed)
  • State validation (can this transition happen?)
  • Business rule enforcement (check-in window, eligibility)
  • ID resolution (email → Contact ID)
  • Error categorization (retryable vs permanent)

Applied to Recovery: Who Decides to Compensate?

We built capacity for both paths:

  • LLM-driven: Staff member receives guest complaint (“I got charged but checkout failed”). Staff-facing agent verifies the problem, calls compensation, explains outcome.
  • Automated: Scheduled job queries for orphaned payments, triggers compensation automation based on business rules.

Same compensation tool, different trigger mechanisms. Conversational UX for reported issues; automation catches unreported failures.

On tool exposure: The compensation tool exists only on the staff MCP server—there’s no guest-facing version. Some tools simply don’t make sense for certain audiences. There’s no “refund on behalf of guest” capability because allowing guests to trigger unmediated refunds isn’t sound business logic. Tool exposure is itself a layer of authorization, complementing the approach to building authorization into tool design discussed earlier in this series.

Infographic showing process flow of "guest checkout", with decision rationale or control points for Human + LLM (WHEN/WHICH tool) stages of process vs Backend (HOW/WHAT gets executed) tactical steps.

Why This Matters

PrincipleImplementationRationale
Strategic vs tactical splitLLM selects workflow; backend executes itClear separation enables independent testing
Financial logic in backendRefund amounts, payment routing, charge validationDeterministic, auditable, not subject to prompt variation
Multiple trigger mechanismsSame tool callable by LLM or cron jobFlexibility without duplicating logic

Two Patterns for Recovery

Two established patterns address partial failure recovery directly. Both are implemented in Dewy Resort.

Pattern 1: Compensating Transactions (Saga Pattern)

The saga pattern treats multi-step workflows as a sequence of operations, each with a corresponding compensating transaction that reverses its effect.[1]

Use this pattern when:

  • Multi-system workflows can partially succeed
  • Financial operations are involved
  • State consistency affects user experience
  • Manual reconciliation cost exceeds automation cost
Infographic with mirrored state pairs for "reverse transactions" in Dewy Resort implementation saga pattern for checkout compensation orchestrator.
Execution flow for Dewy Resort compensation orchestrator. Shows initial actions check for refund availability in Stripe, then check for necessary Salesforce state changes across related objects. Design of state checks across Stripe and Salesforce ensure idempotency across multiple invocations of compensator tool.

How It Works

When checkout fails after payment succeeds, the compensation tool:

  1. Validates payment state (is it refundable?)
  2. Issues refund (financial operations first)
  3. Checks Salesforce state and reverts if needed

The tool accepts what callers naturally have:

{
  "tool": "compensate_checkout_failure",
  "parameters": {
    "payment_intent_id": "pi_3ABC...",
    "guest_email": "beth.gibbs@email.com",
    "idempotency_token": "comp-pi_3ABC...",
    "reason": "Salesforce timeout during checkout"
  }
}

Design Principles

PrincipleImplementationRationale
Check state before reversingQuery current state, only update if neededMakes compensation idempotent—safe to retry
Financial operations firstIssue refund before Salesforce cleanupGuest harm reversed immediately; data fixable async
Business identifiers in, system IDs hiddenAccept guest_email, resolve to Contact ID internallyLLM has email from conversation; logs stay readable
Idempotency at every layerClient token → Backend check → Stripe Idempotency-KeySafe to automate; no double-refunds

Pattern 2: Fail-Fast Validation

Validate assumptions before expensive operations. Preventing failures is cheaper than compensating for them.

Use this pattern when:

  • Operations have prerequisites that could be violated
  • Downstream operations are non-idempotent (payments, external API calls)
  • Clear error messages can guide callers to fix input

Example: The Multiple-Bookings Bug

Original code assumed one booking per guest:

search_booking(guest_email)
update_booking(bookings[0].id, status: "Checked Out")

Problem: What if bookings has 0 or 2+ elements?

  • length == 0: Accesses undefined → crash
  • length > 1: Updates first booking (might be wrong one)

The fix—explicit validation before array access:

bookings = search_booking(guest_email)

IF bookings.length == 0:
  → Return 404 "No checked-in booking found for this guest today"

IF bookings.length > 1:
  → Return 400 "Multiple bookings found. Provide room_number or booking_number to disambiguate"

IF bookings.length == 1:
  → Proceed with checkout

This stops execution before charging payment. If validation happened after payment, you’d need compensation.

Infographic showing 3 approaches to transaction execution L-R: fail mid-execution (expensive--no validation, requires rollback & reverting transactions), fail during validation (cheap, returns error before expensive transaction carried out), pass validation (returns success, all required data validated before expensive transactions attempted).

Design Principles

PrincipleImplementationRationale
Validate before non-idempotent operationsCheck prerequisites before payments, external callsFailures prevented > failures compensated
Validate array lengthsCheck length before accessing elementsPrevents crashes and wrong-record updates
Return actionable errorsSpecific codes (400/404/409) with guidanceCallers can fix input without guessing

Recovery Strategy Quick Reference

StrategyWhen to UseExample
Compensating transactionFinancial operations, state consistency criticalPayment succeeded, Salesforce failed → Issue refund
Acceptable orphanResource has low/zero cost, will be reusedStripe customer created, checkout failed → Customer reused on retry
Fail-fast validationPreventing failure cheaper than recoveringMultiple bookings found → Return 400 before payment
Retry with idempotencyTransient failure, operation is idempotentSalesforce timeout → Retry with same token

Is Your Tool Recovery-Ready?

Saga Pattern

  • Compensation orchestrator exists for financial operations
  • Compensation checks state before reversing (idempotent)
  • Financial compensation prioritized over data cleanup
  • All compensation actions logged for audit trail
  • Idempotency tokens flow through entire compensation flow

Fail-Fast Validation

  • Array lengths validated before element access
  • Prerequisites checked before non-idempotent operations
  • Error codes are specific with actionable guidance

Decision Placement

  • Strategic decisions (when, which workflow) → LLM
  • Tactical decisions (how, what transitions) → Backend
  • Financial/security decisions → Backend (always)
  • Multiple trigger mechanisms supported where needed

Conclusion

MCP standardizes how LLMs discover and invoke tools. What those tools do—and how they handle partial failures, state consistency, and recovery—is architecture you build into the tools themselves.

The saga pattern provides compensating transactions when multi-system workflows fail partway through. Fail-fast validation prevents failures by checking assumptions before expensive operations. Decision placement—where recovery logic lives—determines whether your system is testable, auditable, and flexible.

You can see these approaches in the Dewy Resort sample application. We’ve built a checkout orchestration tool that handles Stripe and Salesforce coordination, financial operations, state consistency across Booking/Room/Opportunity, automatic compensation, and idempotency at every layer.


Implementation: The complete Dewy Resort Hotel example is open source: github.com/workato-devs/dewy-resort


This post builds on Designing Composable Skills for MCP Tools and Serverless MCP Execution. For more on composable architecture patterns, see the complete series.


  1. The saga pattern was introduced by Hector Garcia-Molina and Kenneth Salem in their 1987 paper “Sagas” ACM SIGMOD. For a modern treatment, see Chris Richardson’s Saga pattern documentation.