Status: workaround. This sample demonstrates a pragmatic way to run long-running MCP tools today, before the MCP Task extension (SEP-2663) is available in the Azure Functions MCP trigger. Once tasks land, the protocol handles async natively (server returns
resultType: "task", client pollstasks/getvia the SDK) and this two-tool workaround becomes unnecessary.
An MCP tools/call is request/response. If a tool kicks off work that takes minutes, the client's
request timeout fires long before the work finishes, and the agent sees a failed tool call — even
though the work may still be running. Client tool-call timeouts are not standardized by the MCP
spec; in practice they're often in the ~30–60s range and vary per client. So we must not block a
single tool call for the full duration of a long workflow.
We expose two MCP tools:
-
start_research— starts a Durable orchestration (which fans out to multiple data sources in parallel and aggregates), then awaits completion up to a short budget (~20s, configurable).- If the workflow finishes within budget → return the result inline. The second tool is never needed. This is the common case and it removes any "did the agent remember to poll?" risk.
- If the budget expires → return a handle (
workflow_id) plus an explicit instruction to poll. The orchestration keeps running in Durable storage regardless of the client connection.
-
get_research_result— takes theworkflow_id(a required parameter) and returns the current state:completed(with result),failed(with error), orrunning(poll again).
Ordering is made robust by design, not by hoping the model behaves: workflow_id is a required
parameter of the poll tool (so the agent can't poll without first starting), the "running" response
carries poll_after_seconds and a next instruction, and the budgeted wait means fast workflows
never hit the second tool at all.
Known weakness. Even with all of the above, the poll path still relies on the LLM correctly remembering — and not hallucinating — the orchestration instance ID it was handed. If the model garbles or invents a
workflow_id, the poll lands on the wrong instance or none at all (which is whyget_research_resultreturnsnot_foundrather than guessing). The budgeted wait mitigates this by resolving most calls without a second hop, but it's the core reason the MCP Task extension (SEP-2663) — where the SDK, not the model, carries the handle — is the better long-term answer.
[Function(nameof(StartResearch))]
public async Task<string> StartResearch(
[McpToolTrigger("start_research",
"Researches a topic by gathering information from multiple sources in parallel. "
+ "Returns the result directly if quick; otherwise returns a workflow_id to poll.")]
ToolInvocationContext context,
[McpToolProperty("topic", "The subject to research.", isRequired: true)] string topic,
[DurableClient] DurableTaskClient durableClient)
{
string instanceId = await durableClient.ScheduleNewOrchestrationInstanceAsync(
nameof(ResearchOrchestrator.RunOrchestrator), topic);
// Budgeted wait: blocks until the orchestration is terminal OR the budget cancels the token.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(_waitBudgetSeconds));
try
{
OrchestrationMetadata metadata = await durableClient.WaitForInstanceCompletionAsync(
instanceId, getInputsAndOutputs: true, cts.Token);
// Finished within budget -> return the terminal result (completed or failed) inline.
return Serialize(ToResult(metadata));
}
catch (OperationCanceledException)
{
// Budget expired. The orchestration is NOT lost — it keeps running in Durable storage.
// Hand back a poll handle plus explicit next-step guidance for the agent.
return Serialize(new ResearchResult(
Status: "running",
WorkflowId: instanceId,
PollAfterSeconds: 5,
Next: $"Call get_research_result with workflow_id \"{instanceId}\" in about 5 seconds."));
}
}[Function(nameof(GetResearchResult))]
public async Task<string> GetResearchResult(
[McpToolTrigger("get_research_result",
"Gets the status/result of a workflow started by start_research. "
+ "If status is 'running', wait poll_after_seconds and call again.")]
ToolInvocationContext context,
// Required -> the schema-level dependency that forces start_research to have been called first.
[McpToolProperty("workflow_id", "The workflow_id returned by start_research.", isRequired: true)]
string workflowId,
[DurableClient] DurableTaskClient durableClient)
{
OrchestrationMetadata? metadata =
await durableClient.GetInstancesAsync(workflowId, getInputsAndOutputs: true);
if (metadata is null)
return Serialize(new ResearchResult("not_found", workflowId,
Error: $"No workflow found with id \"{workflowId}\"."));
// Same status mapping as the budgeted-wait path (see ToResult) so a workflow that fails
// *during polling* is reported exactly like one that fails during the initial wait.
return Serialize(ToResult(metadata));
}// "status" is a BEHAVIORAL signal (use result / poll again / stop). Failed/Terminated/Canceled all
// drive the same action, so they share status "failed"; the precise cause is preserved in "reason"
// so no info is lost. "running" means ONLY non-terminal states, so the agent never polls a finished
// workflow. (A missing workflow_id is reported separately as "not_found" — see get_research_result.)
private static ResearchResult ToResult(OrchestrationMetadata metadata) =>
metadata.RuntimeStatus switch
{
OrchestrationRuntimeStatus.Completed => new ResearchResult(
"completed", metadata.InstanceId, Result: metadata.ReadOutputAs<string>()),
OrchestrationRuntimeStatus.Failed
or OrchestrationRuntimeStatus.Terminated
or OrchestrationRuntimeStatus.Canceled => new ResearchResult(
"failed", metadata.InstanceId,
Reason: metadata.RuntimeStatus switch
{
OrchestrationRuntimeStatus.Failed => "error",
OrchestrationRuntimeStatus.Terminated => "terminated",
_ => "canceled"
},
Error: DescribeFailure(metadata)), // pulls detail from the right place per state
_ => new ResearchResult("running", metadata.InstanceId, PollAfterSeconds: 5,
Next: $"Call get_research_result with workflow_id \"{metadata.InstanceId}\" again shortly.")
};See the complete, runnable code (constructor/config, JSON serialization, and the Durable fan-out orchestration) in
ResearchTools.csandResearchOrchestrator.csbelow.
Q: Will agents be smart enough to call start then poll in the right order?
Make workflow_id a required parameter of the poll
tool (the schema enforces ordering — the agent can't poll without a value only start_research
produces), put next/poll_after_seconds instructions in the result payload, and make the poll
tool self-correcting via its status field. The budgeted wait then removes the second tool
entirely for fast workflows.
Q: Why is the wait budget ~20s — what bounds it?
The client tool-call timeout, not the Functions host timeout. The host timeout on Flex/Premium is
generous (minutes), but the client may give up in ~30s, and that's non-standard and varies per client.
So default the budget conservatively (~20s, as an app setting) to stay under the most aggressive
clients, and rely on the poll fallback for anything longer. notifications/progress could extend the
window on clients that honor it, but it's optional and client-dependent, so it's left out here for
clarity.
Q: What happens if the orchestration fails — and why is status not split into failed/terminated/canceled?
Because status exists to direct the agent's behavior, not to mirror Durable's enum. A Failed
(unhandled exception), Terminated (killed via the management API), and Canceled (graceful stop)
all lead to the same next action — stop polling, surface what happened, and likely start a fresh
orchestration — so they share status: "failed". Keeping the status set small and action-oriented
keeps even weaker models reliable on the part that controls the loop. To avoid losing information,
the precise terminal state is preserved in a separate reason field (error | terminated | canceled) plus a human-readable error message. This also fixes a subtle inaccuracy: a
terminated or canceled workflow isn't really an "error," so it shouldn't be labeled one at the
headline — it's a reason, not the status.
A missing workflow_id is deliberately not failed either — it gets its own not_found status,
because the work didn't error; the handle is just unknown (bad id, or the instance history was purged
after its retention window). The right recovery is to start over, not to keep polling.
A few recommendations that would tighten it up before pointing customers at it:
Correctness
Agent ergonomics
Production hardening
Small nit