Skip to content

Instantly share code, notes, and snippets.

@lilyjma
Last active June 9, 2026 18:39
Show Gist options
  • Select an option

  • Save lilyjma/6e05e8c29ce5f4260d33010c125ca3af to your computer and use it in GitHub Desktop.

Select an option

Save lilyjma/6e05e8c29ce5f4260d33010c125ca3af to your computer and use it in GitHub Desktop.
Long-running MCP tools using Durable Functions

Long-running MCP tools using Durable Functions

Status: workaround. This sample demonstrates a pragmatic way to run long-running MCP tools today, before the MCP Task extension (SEP-2663) is available in the Azure Functions MCP trigger. Once tasks land, the protocol handles async natively (server returns resultType: "task", client polls tasks/get via the SDK) and this two-tool workaround becomes unnecessary.

The problem

An MCP tools/call is request/response. If a tool kicks off work that takes minutes, the client's request timeout fires long before the work finishes, and the agent sees a failed tool call — even though the work may still be running. Client tool-call timeouts are not standardized by the MCP spec; in practice they're often in the ~30–60s range and vary per client. So we must not block a single tool call for the full duration of a long workflow.

The approach: budgeted single call + poll fallback

We expose two MCP tools:

  1. start_research — starts a Durable orchestration (which fans out to multiple data sources in parallel and aggregates), then awaits completion up to a short budget (~20s, configurable).

    • If the workflow finishes within budget → return the result inline. The second tool is never needed. This is the common case and it removes any "did the agent remember to poll?" risk.
    • If the budget expires → return a handle (workflow_id) plus an explicit instruction to poll. The orchestration keeps running in Durable storage regardless of the client connection.
  2. get_research_result — takes the workflow_id (a required parameter) and returns the current state: completed (with result), failed (with error), or running (poll again).

Ordering is made robust by design, not by hoping the model behaves: workflow_id is a required parameter of the poll tool (so the agent can't poll without first starting), the "running" response carries poll_after_seconds and a next instruction, and the budgeted wait means fast workflows never hit the second tool at all.

Known weakness. Even with all of the above, the poll path still relies on the LLM correctly remembering — and not hallucinating — the orchestration instance ID it was handed. If the model garbles or invents a workflow_id, the poll lands on the wrong instance or none at all (which is why get_research_result returns not_found rather than guessing). The budgeted wait mitigates this by resolving most calls without a second hop, but it's the core reason the MCP Task extension (SEP-2663) — where the SDK, not the model, carries the handle — is the better long-term answer.

start_research — start, then await up to a budget

[Function(nameof(StartResearch))]
public async Task<string> StartResearch(
    [McpToolTrigger("start_research",
        "Researches a topic by gathering information from multiple sources in parallel. "
        + "Returns the result directly if quick; otherwise returns a workflow_id to poll.")]
        ToolInvocationContext context,
    [McpToolProperty("topic", "The subject to research.", isRequired: true)] string topic,
    [DurableClient] DurableTaskClient durableClient)
{
    string instanceId = await durableClient.ScheduleNewOrchestrationInstanceAsync(
        nameof(ResearchOrchestrator.RunOrchestrator), topic);

    // Budgeted wait: blocks until the orchestration is terminal OR the budget cancels the token.
    using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(_waitBudgetSeconds));
    try
    {
        OrchestrationMetadata metadata = await durableClient.WaitForInstanceCompletionAsync(
            instanceId, getInputsAndOutputs: true, cts.Token);

        // Finished within budget -> return the terminal result (completed or failed) inline.
        return Serialize(ToResult(metadata));
    }
    catch (OperationCanceledException)
    {
        // Budget expired. The orchestration is NOT lost — it keeps running in Durable storage.
        // Hand back a poll handle plus explicit next-step guidance for the agent.
        return Serialize(new ResearchResult(
            Status: "running",
            WorkflowId: instanceId,
            PollAfterSeconds: 5,
            Next: $"Call get_research_result with workflow_id \"{instanceId}\" in about 5 seconds."));
    }
}

get_research_result — poll by workflow_id

[Function(nameof(GetResearchResult))]
public async Task<string> GetResearchResult(
    [McpToolTrigger("get_research_result",
        "Gets the status/result of a workflow started by start_research. "
        + "If status is 'running', wait poll_after_seconds and call again.")]
        ToolInvocationContext context,
    // Required -> the schema-level dependency that forces start_research to have been called first.
    [McpToolProperty("workflow_id", "The workflow_id returned by start_research.", isRequired: true)]
        string workflowId,
    [DurableClient] DurableTaskClient durableClient)
{
    OrchestrationMetadata? metadata =
        await durableClient.GetInstancesAsync(workflowId, getInputsAndOutputs: true);

    if (metadata is null)
        return Serialize(new ResearchResult("not_found", workflowId,
            Error: $"No workflow found with id \"{workflowId}\"."));

    // Same status mapping as the budgeted-wait path (see ToResult) so a workflow that fails
    // *during polling* is reported exactly like one that fails during the initial wait.
    return Serialize(ToResult(metadata));
}

ToResult — one closed contract for both tools

// "status" is a BEHAVIORAL signal (use result / poll again / stop). Failed/Terminated/Canceled all
// drive the same action, so they share status "failed"; the precise cause is preserved in "reason"
// so no info is lost. "running" means ONLY non-terminal states, so the agent never polls a finished
// workflow. (A missing workflow_id is reported separately as "not_found" — see get_research_result.)
private static ResearchResult ToResult(OrchestrationMetadata metadata) =>
    metadata.RuntimeStatus switch
    {
        OrchestrationRuntimeStatus.Completed => new ResearchResult(
            "completed", metadata.InstanceId, Result: metadata.ReadOutputAs<string>()),

        OrchestrationRuntimeStatus.Failed
        or OrchestrationRuntimeStatus.Terminated
        or OrchestrationRuntimeStatus.Canceled => new ResearchResult(
            "failed", metadata.InstanceId,
            Reason: metadata.RuntimeStatus switch
            {
                OrchestrationRuntimeStatus.Failed => "error",
                OrchestrationRuntimeStatus.Terminated => "terminated",
                _ => "canceled"
            },
            Error: DescribeFailure(metadata)),   // pulls detail from the right place per state

        _ => new ResearchResult("running", metadata.InstanceId, PollAfterSeconds: 5,
            Next: $"Call get_research_result with workflow_id \"{metadata.InstanceId}\" again shortly.")
    };

See the complete, runnable code (constructor/config, JSON serialization, and the Durable fan-out orchestration) in ResearchTools.cs and ResearchOrchestrator.cs below.

Q&A

Q: Will agents be smart enough to call start then poll in the right order? Make workflow_id a required parameter of the poll tool (the schema enforces ordering — the agent can't poll without a value only start_research produces), put next/poll_after_seconds instructions in the result payload, and make the poll tool self-correcting via its status field. The budgeted wait then removes the second tool entirely for fast workflows.

Q: Why is the wait budget ~20s — what bounds it? The client tool-call timeout, not the Functions host timeout. The host timeout on Flex/Premium is generous (minutes), but the client may give up in ~30s, and that's non-standard and varies per client. So default the budget conservatively (~20s, as an app setting) to stay under the most aggressive clients, and rely on the poll fallback for anything longer. notifications/progress could extend the window on clients that honor it, but it's optional and client-dependent, so it's left out here for clarity.

Q: What happens if the orchestration fails — and why is status not split into failed/terminated/canceled? Because status exists to direct the agent's behavior, not to mirror Durable's enum. A Failed (unhandled exception), Terminated (killed via the management API), and Canceled (graceful stop) all lead to the same next action — stop polling, surface what happened, and likely start a fresh orchestration — so they share status: "failed". Keeping the status set small and action-oriented keeps even weaker models reliable on the part that controls the loop. To avoid losing information, the precise terminal state is preserved in a separate reason field (error | terminated | canceled) plus a human-readable error message. This also fixes a subtle inaccuracy: a terminated or canceled workflow isn't really an "error," so it shouldn't be labeled one at the headline — it's a reason, not the status.

A missing workflow_id is deliberately not failed either — it gets its own not_found status, because the work didn't error; the handle is just unknown (bad id, or the instance history was purged after its retention window). The right recovery is to start over, not to keep polling.

using System.Text;
using Microsoft.Azure.Functions.Worker;
using Microsoft.DurableTask;
using Microsoft.Extensions.Logging;
namespace MyMcpApp;
/// <summary>
/// The Durable orchestration that does the actual long-running work.
///
/// This is fan-out/fan-in — the one thing a single stateless MCP tool function can't do well.
/// It dispatches several "data source" activities IN PARALLEL (Task.WhenAll) and aggregates the
/// results. Each source is slow on its own; sequentially they'd blow the client timeout, but the
/// orchestration runs them concurrently with Durable's retry/reliability on top.
///
/// In a real sample each activity would use a binding/integration that MCP authors underuse, e.g.
/// Azure AI Search, Cosmos DB, Blob storage, a web search API, or Azure OpenAI for summarization.
/// Here they are simulated with delays so the sample runs without external dependencies.
/// </summary>
public static class ResearchOrchestrator
{
[Function(nameof(RunOrchestrator))]
public static async Task<string> RunOrchestrator(
[OrchestrationTrigger] TaskOrchestrationContext context,
string topic)
{
ILogger logger = context.CreateReplaySafeLogger(nameof(ResearchOrchestrator));
logger.LogInformation("Orchestrating research for '{Topic}'", topic);
// Fan out: kick off every source in parallel.
var tasks = new List<Task<string>>
{
context.CallActivityAsync<string>(nameof(SearchInternalDocs), topic),
context.CallActivityAsync<string>(nameof(SearchWeb), topic),
context.CallActivityAsync<string>(nameof(LookupFinancials), topic),
context.CallActivityAsync<string>(nameof(LookupCrmHistory), topic),
};
// Fan in: wait for all of them.
string[] findings = await Task.WhenAll(tasks);
// Aggregate (in a real sample, summarize via Azure OpenAI here).
var report = new StringBuilder();
report.AppendLine($"# Research report: {topic}");
report.AppendLine();
foreach (string finding in findings)
{
report.AppendLine($"- {finding}");
}
return report.ToString();
}
[Function(nameof(SearchInternalDocs))]
public static async Task<string> SearchInternalDocs([ActivityTrigger] string topic)
{
await Task.Delay(TimeSpan.FromSeconds(3)); // simulate Azure AI Search
return $"Internal docs: 4 documents referencing '{topic}'.";
}
[Function(nameof(SearchWeb))]
public static async Task<string> SearchWeb([ActivityTrigger] string topic)
{
await Task.Delay(TimeSpan.FromSeconds(4)); // simulate web search API
return $"Web: recent news and articles about '{topic}'.";
}
[Function(nameof(LookupFinancials))]
public static async Task<string> LookupFinancials([ActivityTrigger] string topic)
{
await Task.Delay(TimeSpan.FromSeconds(2)); // simulate financial data API
return $"Financials: latest figures related to '{topic}'.";
}
[Function(nameof(LookupCrmHistory))]
public static async Task<string> LookupCrmHistory([ActivityTrigger] string topic)
{
await Task.Delay(TimeSpan.FromSeconds(3)); // simulate Cosmos DB CRM lookup
return $"CRM: existing relationship history for '{topic}'.";
}
}
using System.Text.Json;
using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Extensions.Mcp;
using Microsoft.DurableTask;
using Microsoft.DurableTask.Client;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.Logging;
namespace MyMcpApp;
/// <summary>
/// Two MCP tools that implement the "budgeted single call + poll fallback" pattern for
/// long-running work, backed by a Durable Functions orchestration.
///
/// This is a WORKAROUND until the MCP Task extension (SEP-2663) is supported by the Functions
/// MCP trigger. Once tasks are native, the server can return a task handle and the client polls
/// tasks/get via the SDK, making this two-tool pattern unnecessary.
/// </summary>
public class ResearchTools
{
private readonly ILogger<ResearchTools> _logger;
private readonly int _waitBudgetSeconds;
public ResearchTools(ILogger<ResearchTools> logger, IConfiguration config)
{
_logger = logger;
// The wait budget is configurable, NOT a hard-coded constant. The right value is
// bounded by the *client* tool-call timeout (non-standard, often ~30-60s), not the
// Functions host timeout. Default conservatively to stay under aggressive clients.
_waitBudgetSeconds = config.GetValue("ResearchWaitBudgetSeconds", 20);
}
/// <summary>
/// Starts the research orchestration and awaits it up to a short budget.
/// Fast workflows return their result inline (the poll tool is never needed);
/// slow workflows return a workflow_id handle and an instruction to poll.
/// </summary>
[Function(nameof(StartResearch))]
public async Task<string> StartResearch(
[McpToolTrigger("start_research",
"Starts a research workflow that gathers and aggregates information about a topic "
+ "from multiple sources in parallel. Returns the result directly if it finishes "
+ "quickly; otherwise returns a workflow_id to poll with get_research_result.")]
ToolInvocationContext context,
[McpToolProperty("topic", "The subject to research.", isRequired: true)] string topic,
[DurableClient] DurableTaskClient durableClient)
{
string instanceId = await durableClient.ScheduleNewOrchestrationInstanceAsync(
nameof(ResearchOrchestrator.RunOrchestrator), topic);
_logger.LogInformation("Started research orchestration {InstanceId} for topic '{Topic}'",
instanceId, topic);
// Budgeted wait: WaitForInstanceCompletionAsync blocks until the orchestration reaches
// a terminal state OR the CancellationToken fires. We impose the budget via CancelAfter.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(_waitBudgetSeconds));
try
{
OrchestrationMetadata metadata = await durableClient.WaitForInstanceCompletionAsync(
instanceId, getInputsAndOutputs: true, cts.Token);
// Finished within budget -> return the terminal result (completed or failed) inline.
return Serialize(ToResult(metadata));
}
catch (OperationCanceledException)
{
// Budget expired. The orchestration is still running in Durable storage and is
// NOT lost. Hand back a poll handle plus explicit next-step guidance for the agent.
_logger.LogInformation(
"Research {InstanceId} exceeded {Budget}s budget; returning poll handle.",
instanceId, _waitBudgetSeconds);
return Serialize(new ResearchResult(
Status: "running",
WorkflowId: instanceId,
PollAfterSeconds: 5,
Next: $"Call get_research_result with workflow_id \"{instanceId}\" in about 5 seconds."));
}
}
/// <summary>
/// Polls a previously started research workflow by its workflow_id.
/// </summary>
[Function(nameof(GetResearchResult))]
public async Task<string> GetResearchResult(
[McpToolTrigger("get_research_result",
"Gets the status/result of a research workflow started by start_research. "
+ "If status is 'running', wait poll_after_seconds and call again.")]
ToolInvocationContext context,
// workflow_id is REQUIRED -> the schema-level dependency that forces start_research to
// have been called first. This is what makes tool ordering robust without relying on the
// agent's judgement.
[McpToolProperty("workflow_id", "The workflow_id returned by start_research.", isRequired: true)]
string workflowId,
[DurableClient] DurableTaskClient durableClient)
{
OrchestrationMetadata? metadata =
await durableClient.GetInstancesAsync(workflowId, getInputsAndOutputs: true);
if (metadata is null)
{
// Distinct from "failed": the work didn't error, the handle is unknown (bad id, or the
// instance history was purged after its retention window). The agent's right move is to
// start a fresh orchestration, not to keep polling — so it gets its own status.
return Serialize(new ResearchResult(
Status: "not_found",
WorkflowId: workflowId,
Error: $"No workflow found with id \"{workflowId}\"."));
}
// Identical status mapping as the budgeted-wait path. A workflow that fails AFTER the
// budget (i.e. during polling) is reported exactly like one that fails during the wait.
return Serialize(ToResult(metadata));
}
/// <summary>
/// The single source of truth for mapping a Durable runtime status to the closed
/// { completed | failed | running } contract, shared by the wait path and the poll path.
///
/// "status" is a BEHAVIORAL signal: it tells the agent what to do (use the result / poll again /
/// stop and likely start over). Failed, Terminated, and Canceled all drive the same action, so
/// they share status "failed". To avoid losing information, the precise terminal state is
/// preserved in "reason", and a human-readable detail in "error". "running" means ONLY the
/// non-terminal states, so the agent never polls a workflow that is already finished.
/// </summary>
private static ResearchResult ToResult(OrchestrationMetadata metadata) =>
metadata.RuntimeStatus switch
{
OrchestrationRuntimeStatus.Completed => new ResearchResult(
"completed", metadata.InstanceId, Result: metadata.ReadOutputAs<string>()),
OrchestrationRuntimeStatus.Failed
or OrchestrationRuntimeStatus.Terminated
or OrchestrationRuntimeStatus.Canceled => new ResearchResult(
"failed", metadata.InstanceId,
Reason: metadata.RuntimeStatus switch
{
OrchestrationRuntimeStatus.Failed => "error",
OrchestrationRuntimeStatus.Terminated => "terminated",
_ => "canceled"
},
Error: DescribeFailure(metadata)),
// Running / Pending / Suspended / ContinuedAsNew -> still in flight.
_ => new ResearchResult(
Status: "running",
WorkflowId: metadata.InstanceId,
PollAfterSeconds: 5,
Next: $"Call get_research_result with workflow_id \"{metadata.InstanceId}\" in about 5 seconds.")
};
/// <summary>
/// Pulls a human-readable detail per terminal state. The reason lives in a DIFFERENT place
/// depending on how the orchestration ended, and none is guaranteed to be populated:
/// - Failed -> FailureDetails.ErrorMessage
/// - Terminated -> the terminate reason is stored as the instance output
/// - Canceled -> usually no detail
/// Each branch falls back to a generic message so we never assume a property is set.
/// </summary>
private static string DescribeFailure(OrchestrationMetadata metadata) =>
metadata.RuntimeStatus switch
{
OrchestrationRuntimeStatus.Failed =>
metadata.FailureDetails?.ErrorMessage ?? "Orchestration failed.",
OrchestrationRuntimeStatus.Terminated =>
SafeOutput(metadata) ?? "Orchestration was terminated.",
OrchestrationRuntimeStatus.Canceled =>
SafeOutput(metadata) ?? "Orchestration was canceled.",
_ => metadata.RuntimeStatus.ToString()
};
private static string? SafeOutput(OrchestrationMetadata metadata)
{
if (string.IsNullOrEmpty(metadata.SerializedOutput))
return null;
try { return metadata.ReadOutputAs<string>(); }
catch { return metadata.SerializedOutput; }
}
private static readonly JsonSerializerOptions JsonOpts = new()
{
DefaultIgnoreCondition = System.Text.Json.Serialization.JsonIgnoreCondition.WhenWritingNull
};
private static string Serialize(ResearchResult result) => JsonSerializer.Serialize(result, JsonOpts);
}
/// <summary>
/// The closed contract returned to the MCP client. Null fields are omitted from the JSON.
/// "status" drives agent behavior; "reason" preserves the precise terminal cause when status is
/// "failed" (error | terminated | canceled) so no information is lost.
/// </summary>
public record ResearchResult(
string Status,
string WorkflowId,
string? Result = null,
string? Reason = null,
string? Error = null,
int? PollAfterSeconds = null,
string? Next = null);
@liliankasem

Copy link
Copy Markdown

A few recommendations that would tighten it up before pointing customers at it:

Correctness

  • get_research_result should call GetInstanceAsync(workflowId, getInputsAndOutputs: true), not GetInstancesAsync (the latter is the multi-instance query API and doesn't take a bare id).
  • metadata.ReadOutputAs() assumes the orchestrator returns a string. A fan-out/aggregate orchestrator almost always returns a structured result, in which case a successful run will throw at deserialization and the agent will see failed. Reading it as JsonElement (or object?) keeps the contract honest.

Agent ergonomics

  • A flat poll_after_seconds: 5 is rough on long jobs, which is exactly the scenario this pattern targets. A 4-minute workflow becomes ~48 tool calls and ~48 LLM turns. Exponential backoff with a cap (e.g. 5s → 10s → 20s, max 30s) costs nothing and removes most of that overhead. Easiest way to thread it is an optional attempt parameter the agent increments, but even a server-side jitter is better than flat.
  • Consider letting the orchestrator publish a custom status (SetCustomStatusAsync) and surfacing it on the running response. Gives the agent something to say to the user during long waits instead of just "still running."

Production hardening

  • The workflow_id is currently an unauthenticated bearer-style handle: anyone who sees it can read the result via get_research_result. In stateless MCP, this extension already attaches the caller's principal per request, so stamping the orchestration input with the user identity and checking it in the poll tool is a small change worth calling out before customers ship this.
  • Durable purges instance history per its retention policy, so a slow agent that polls hours later will legitimately get not_found for a workflow that actually ran. Worth mentioning in the Q&A so it isn't read as "bad id."
  • The budgeted wait parks a Functions worker for the budget duration per call. Not a bug, but worth flagging the trade-off so readers tune the budget for their concurrency, or set it to 0 on hot paths and skip the inline-fast-path entirely.

Small nit

  • The notifications/progress aside reads as if it's available today. The Functions MCP extension doesn't currently surface a way for a tool to emit progress mid-call (that's part of the streaming-output gap in Azure/azure-functions-mcp-extension#251), so I'd either drop it or tag it as "once #251 lands."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment