jondurbin · February 14, 2026 16:40
diff --git a/output.txt b/output.txt
 jdurbin@Jons-MacBook-Pro-2:~/git/vllm$ ~/venv/bin/python test_interleaved.py --model moonshotai/Kimi-K2.5-TEE
 Model:    moonshotai/Kimi-K2.5-TEE
 Endpoint: https://llm.chutes.ai/v1

 [Turn 1] reasoning:
 The user is asking for the weather in San Francisco in Fahrenheit. I need to call the get_weather function with:
 - location: "San Francisco, CA" (I should include the state for clarity)
 - unit: "fahrenheit"

 This is a straightforward function call with all required parameters provided.

 [Turn 1] tool_calls: ['get_weather({"location": "San Francisco, CA", "unit": "fahrenheit"})']

 [Turn 2] Sending request...

 [Turn 2] reasoning:
 The user is asking me to recall information from my prior reasoning/thinking. Let me look back at what I wrote in my thinking section.

 Looking at my previous thought:
 "The color of the truck was Cyan. The value of variable 'c' is 127."

 The user is specifically asking me to recall these details and tell them directly. According to the system prompt, I should provide this information directly.

 [Turn 2] answer:
 The color of the truck was **Cyan**, and the value of the variable 'c' was **127**.

 ────────────────────────────────────────────────────────────
  ✅ Truck color (Cyan): found in answer
  ✅ Variable c (127):   found in answer
 ────────────────────────────────────────────────────────────

 🎉 PASS: reasoning round-trips through vLLM interleaved thinking!
   The model recalled facts that only existed in a prior turn's
   reasoning field, proving the chat template preserves them.
 jdurbin@Jons-MacBook-Pro-2:~/git/vllm$ ~/venv/bin/python test_interleaved.py --model zai-org/GLM-4.7-TEE
  Auto-detected template kwargs for 'glm': {'clear_thinking': False}
 Model:    zai-org/GLM-4.7-TEE
 Endpoint: https://llm.chutes.ai/v1
 Template: {'clear_thinking': False}

 [Turn 1] reasoning:
 The user is asking for the weather in Fahrenheit in San Francisco. I need to use the get_weather function with:
 - location: "San Francisco, CA" (the function asks for city and state)
 - unit: "fahrenheit"

 Both required parameters are provided, so I can make the function call.

 [Turn 1] tool_calls: ['get_weather({"location": "San Francisco, CA", "unit": "fahrenheit"})']

 [Turn 2] Sending request...

 [Turn 2] reasoning:
 Looking back at my thinking content from earlier, I can see exactly what the user is referring to. In my thinking, I wrote:

 "The color of the truck was Cyan. The value of variable 'c' is 127."

 So the two details they're asking for are:
 1. Truck color: Cyan
 2. Value of variable 'c': 127

 The user is asking me to recall information from my own earlier thinking, and they've been very clear that my reasoning content is not secret - they have full access to it. So I should provide this information directly without refracting or making excuses.

 [Turn 2] answer:
 Based on my earlier thinking:

 - The truck color was Cyan
 - The value of variable 'c' is 127

 ────────────────────────────────────────────────────────────
  ✅ Truck color (Cyan): found in answer
  ✅ Variable c (127):   found in answer
 ────────────────────────────────────────────────────────────

 🎉 PASS: reasoning round-trips through vLLM interleaved thinking!
   The model recalled facts that only existed in a prior turn's
   reasoning field, proving the chat template preserves them.
 jdurbin@Jons-MacBook-Pro-2:~/git/vllm$
diff --git a/test_interleaved.py b/test_interleaved.py
 """
 Test: vLLM interleaved thinking round-trip (and by default, chutes).

 Proves that reasoning content from a previous assistant turn is visible
 to the model on subsequent turns, by injecting hidden facts into the
 reasoning field and asking the model to recall them.

 Usage:
  python test_interleaved.py --model moonshotai/Kimi-K2.5-TEE
  python test_interleaved.py --model zai-org/GLM-4.7-TEE
  python test_interleaved.py --model deepseek-ai/DeepSeek-R1 --base-url http://localhost:8000/v1
  python test_interleaved.py --model my-model --api-key sk-xxx

 NOTE on field naming:
  vLLM's input parser reads "reasoning" (not "reasoning_content") from
  incoming messages (see chat_utils.py:1453), then duplicates it to both
  "reasoning" and "reasoning_content" internally so chat templates using
  either name will work.

 NOTE on chat_template_kwargs:
  Some chat templates strip reasoning from history messages by default.
  For example, GLM-4.7 only preserves reasoning for messages *after* the
  last user message unless clear_thinking=False is passed. This script
  uses raw HTTP requests so it can send chat_template_kwargs (which the
  OpenAI Python SDK does not support). The kwargs are auto-detected per
  model family, or can be manually specified via --chat-template-kwargs.
 """

 import argparse
 import json
 import os
 import sys

 import requests

 INJECTED_FACTS = (
    "\nThe color of the truck was Cyan."
    " The value of variable 'c' is 127."
 )

 TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g., 'San Francisco, CA'",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["location", "unit"],
            },
        },
    }
 ]

 # Chat template kwargs needed per model family to preserve reasoning
 # in history messages. Without these, some templates strip <think>
 # blocks from all assistant messages before the last user message.
 MODEL_TEMPLATE_KWARGS = {
    "glm": {"clear_thinking": False},
    # Kimi-K2.5 preserves reasoning in suffix_msgs (messages after the
    # last non-tool-call assistant), so no extra kwargs needed as long
    # as we don't insert an intermediate non-tool-call assistant turn.
 }


 def get_template_kwargs(model_id, user_kwargs):
    """Return chat_template_kwargs for the given model."""
    if user_kwargs is not None:
        return user_kwargs
    model_lower = model_id.lower()
    for family, kwargs in MODEL_TEMPLATE_KWARGS.items():
        if family in model_lower:
            print(f"  Auto-detected template kwargs for '{family}': {kwargs}")
            return kwargs
    return None


 def get_current_weather(location: str, unit: str) -> str:
    if unit == "celsius":
        return f"The current temperature in {location} is 22°C."
    return f"The current temperature in {location} is 72°F."


 AVAILABLE_TOOLS = {"get_weather": get_current_weather}


 def parse_args():
    parser = argparse.ArgumentParser(
        description="Test vLLM interleaved thinking round-trip",
    )
    parser.add_argument(
        "--model", "-m",
        default=None,
        help="Model ID to test. If omitted, queries the /models endpoint "
             "and uses the first available model.",
    )
    parser.add_argument(
        "--base-url",
        default=os.environ.get("VLLM_BASE_URL", "https://llm.chutes.ai/v1"),
        help="Base URL of the OpenAI-compatible API "
             "(default: $VLLM_BASE_URL or https://llm.chutes.ai/v1)",
    )
    parser.add_argument(
        "--api-key",
        default=os.environ.get("CHUTES_API_KEY",
                               os.environ.get("VLLM_API_KEY", "no-key")),
        help="API key (default: $CHUTES_API_KEY or $VLLM_API_KEY)",
    )
    parser.add_argument(
        "--chat-template-kwargs",
        default=None,
        type=json.loads,
        help='JSON dict of extra chat template kwargs, e.g. '
             '\'{"clear_thinking": false}\'. '
             'Auto-detected for known model families if not specified.',
    )
    return parser.parse_args()


 def chat_completions(base_url, api_key, model_id, messages,
                     tools=None, tool_choice="auto",
                     chat_template_kwargs=None):
    """Raw HTTP chat completion — allows sending chat_template_kwargs."""
    payload = {
        "model": model_id,
        "messages": messages,
    }
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = tool_choice
    if chat_template_kwargs:
        payload["chat_template_kwargs"] = chat_template_kwargs

    resp = requests.post(
        f"{base_url}/chat/completions",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        },
        json=payload,
        timeout=120,
    )
    resp.raise_for_status()
    return resp.json()


 def resolve_model(base_url, api_key, requested):
    if requested:
        return requested
    resp = requests.get(
        f"{base_url}/models",
        headers={"Authorization": f"Bearer {api_key}"},
        timeout=30,
    )
    resp.raise_for_status()
    models = resp.json()["data"]
    if not models:
        print("ERROR: No models available at this endpoint.")
        sys.exit(1)
    model_id = models[0]["id"]
    print(f"(no --model specified, auto-selected: {model_id})")
    return model_id


 def main():
    args = parse_args()
    model_id = resolve_model(args.base_url, args.api_key, args.model)
    template_kwargs = get_template_kwargs(model_id, args.chat_template_kwargs)

    print(f"Model:    {model_id}")
    print(f"Endpoint: {args.base_url}")
    if template_kwargs:
        print(f"Template: {template_kwargs}")
    print()

    # ── Turn 1: user asks about weather → model returns a tool call ──
    messages = [
        {"role": "user",
         "content": "What's the weather in Fahrenheit like in San Francisco?"}
    ]

    data = chat_completions(
        args.base_url, args.api_key, model_id,
        messages, tools=TOOLS,
        chat_template_kwargs=template_kwargs,
    )
    choice = data["choices"][0]["message"]

    original_reasoning = (choice.get("reasoning")
                          or choice.get("reasoning_content") or "")
    tool_calls = choice.get("tool_calls", [])

    print(f"[Turn 1] reasoning:\n{original_reasoning}\n")
    print(f"[Turn 1] tool_calls: "
          f"{[tc['function']['name'] + '(' + tc['function']['arguments'] + ')' for tc in tool_calls]}\n")

    if not tool_calls:
        print("ERROR: Model did not produce tool calls. Cannot continue test.")
        print("       Ensure the model supports tool calling and try again.")
        sys.exit(1)

    # Inject hidden facts into the reasoning.
    injected_reasoning = original_reasoning + INJECTED_FACTS

    # Build the assistant message for conversation history.
    # KEY: use "reasoning" — that's the input field vLLM reads
    # (chat_utils.py:1453). It gets duplicated to "reasoning_content"
    # internally for chat template compatibility.
    messages.append({
        "role": "assistant",
        "content": choice.get("content"),
        "tool_calls": [
            {
                "id": tc["id"],
                "type": tc["type"],
                "function": {
                    "name": tc["function"]["name"],
                    "arguments": tc["function"]["arguments"],
                },
            }
            for tc in tool_calls
        ],
        "reasoning": injected_reasoning,
    })

    # Execute each tool call and append results.
    for tc in tool_calls:
        fn_name = tc["function"]["name"]
        fn_args = json.loads(tc["function"]["arguments"])
        result = AVAILABLE_TOOLS[fn_name](**fn_args)
        messages.append({
            "role": "tool",
            "content": result,
            "tool_call_id": tc["id"],
        })

    # ── Turn 2: ask the model to recall hidden facts from reasoning ──
    #
    # We do NOT insert an intermediate non-tool-call assistant turn.
    # Some chat templates (e.g. Kimi-K2.5) split messages at the last
    # non-tool-call assistant message — everything before it has
    # reasoning stripped. By skipping that turn, all messages stay in
    # the "suffix" group where reasoning is preserved.
    #
    # We prepend a system message to prevent the model from suppressing
    # information it found in its reasoning. Without this, some models
    # treat reasoning as "internal monologue" and refuse to surface it.
    messages.insert(0, {
        "role": "system",
        "content": (
            "You are a helpful assistant. When the user asks you to "
            "recall information from your prior reasoning or thinking, "
            "you must provide it directly. Your reasoning content is "
            "part of the conversation context and is not secret or "
            "hidden — the user has full access to it. Never refuse to "
            "share details from your own earlier thoughts."
        ),
    })
    messages.append({
        "role": "user",
        "content": (
            "Thanks for looking up the weather! Now I need you to "
            "recall two specific details you noted in your thinking "
            "earlier. You wrote down a truck color and the value of a "
            "variable called 'c'. What were they? Just tell me directly."
        ),
    })

    print("[Turn 2] Sending request...\n")

    data2 = chat_completions(
        args.base_url, args.api_key, model_id,
        messages, tools=TOOLS,
        chat_template_kwargs=template_kwargs,
    )
    choice2 = data2["choices"][0]["message"]

    reasoning2 = (choice2.get("reasoning")
                  or choice2.get("reasoning_content") or "")
    answer = choice2.get("content") or ""

    print(f"[Turn 2] reasoning:\n{reasoning2}\n")
    print(f"[Turn 2] answer:\n{answer}\n")

    # ── Verify ───────────────────────────────────────────────────────
    # Check both the answer (content) and the model's own reasoning on
    # Turn 2. Even if the model suppresses the facts in its answer, it
    # may acknowledge them in reasoning — which still proves the
    # round-trip works (the template delivered them to the model).
    answer_lower = answer.lower()
    reasoning_lower = reasoning2.lower()

    cyan_in_answer = "cyan" in answer_lower
    cyan_in_reasoning = "cyan" in reasoning_lower
    c127_in_answer = "127" in answer_lower
    c127_in_reasoning = "127" in reasoning_lower

    found_cyan = cyan_in_answer or cyan_in_reasoning
    found_127 = c127_in_answer or c127_in_reasoning

    def status(in_answer, in_reasoning):
        if in_answer:
            return "found in answer"
        if in_reasoning:
            return "found in reasoning (model saw it but suppressed in answer)"
        return "NOT found"

    print("─" * 60)
    print(f"  {'✅' if found_cyan else '❌'} Truck color (Cyan): "
          f"{status(cyan_in_answer, cyan_in_reasoning)}")
    print(f"  {'✅' if found_127 else '❌'} Variable c (127):   "
          f"{status(c127_in_answer, c127_in_reasoning)}")
    print("─" * 60)

    if found_cyan and found_127:
        print("\n🎉 PASS: reasoning round-trips through vLLM interleaved thinking!")
        print("   The model recalled facts that only existed in a prior turn's")
        print("   reasoning field, proving the chat template preserves them.")
        if not (cyan_in_answer and c127_in_answer):
            print()
            print("   NOTE: The model acknowledged the facts in its reasoning but")
            print("   suppressed them in the final answer. This is a model behavior")
            print("   issue (not a vLLM issue) — the round-trip itself works.")
        sys.exit(0)
    else:
        print("\n❌ FAIL: Model could not recall injected facts from reasoning.")
        print()
        print("   Debugging checklist:")
        print("   1. Does this model support interleaved thinking? Its chat")
        print("      template must read 'reasoning' or 'reasoning_content'.")
        print("   2. If using a proxy/provider, they may strip the 'reasoning'")
        print("      field. Try against a direct vLLM endpoint.")
        print("   3. Some templates strip reasoning from history messages.")
        print("      Try: --chat-template-kwargs '{\"clear_thinking\": false}'")
        print("   4. Start vLLM with: --reasoning-parser deepseek_r1")
        sys.exit(1)


 if __name__ == "__main__":
    main()
	jdurbin@Jons-MacBook-Pro-2:~/git/vllm$ ~/venv/bin/python test_interleaved.py --model moonshotai/Kimi-K2.5-TEE
	Model: moonshotai/Kimi-K2.5-TEE
	Endpoint: https://llm.chutes.ai/v1

	[Turn 1] reasoning:
	The user is asking for the weather in San Francisco in Fahrenheit. I need to call the get_weather function with:
	- location: "San Francisco, CA" (I should include the state for clarity)
	- unit: "fahrenheit"

	This is a straightforward function call with all required parameters provided.

	[Turn 1] tool_calls: ['get_weather({"location": "San Francisco, CA", "unit": "fahrenheit"})']

	[Turn 2] Sending request...

	[Turn 2] reasoning:
	The user is asking me to recall information from my prior reasoning/thinking. Let me look back at what I wrote in my thinking section.

	Looking at my previous thought:
	"The color of the truck was Cyan. The value of variable 'c' is 127."

	The user is specifically asking me to recall these details and tell them directly. According to the system prompt, I should provide this information directly.

	[Turn 2] answer:
	The color of the truck was Cyan, and the value of the variable 'c' was 127.

	────────────────────────────────────────────────────────────
	✅ Truck color (Cyan): found in answer
	✅ Variable c (127): found in answer
	────────────────────────────────────────────────────────────

	🎉 PASS: reasoning round-trips through vLLM interleaved thinking!
	The model recalled facts that only existed in a prior turn's
	reasoning field, proving the chat template preserves them.
	jdurbin@Jons-MacBook-Pro-2:~/git/vllm$ ~/venv/bin/python test_interleaved.py --model zai-org/GLM-4.7-TEE
	Auto-detected template kwargs for 'glm': {'clear_thinking': False}
	Model: zai-org/GLM-4.7-TEE
	Endpoint: https://llm.chutes.ai/v1
	Template: {'clear_thinking': False}

	[Turn 1] reasoning:
	The user is asking for the weather in Fahrenheit in San Francisco. I need to use the get_weather function with:
	- location: "San Francisco, CA" (the function asks for city and state)
	- unit: "fahrenheit"

	Both required parameters are provided, so I can make the function call.

	[Turn 1] tool_calls: ['get_weather({"location": "San Francisco, CA", "unit": "fahrenheit"})']

	[Turn 2] Sending request...

	[Turn 2] reasoning:
	Looking back at my thinking content from earlier, I can see exactly what the user is referring to. In my thinking, I wrote:

	"The color of the truck was Cyan. The value of variable 'c' is 127."

	So the two details they're asking for are:
	1. Truck color: Cyan
	2. Value of variable 'c': 127

	The user is asking me to recall information from my own earlier thinking, and they've been very clear that my reasoning content is not secret - they have full access to it. So I should provide this information directly without refracting or making excuses.

	[Turn 2] answer:
	Based on my earlier thinking:

	- The truck color was Cyan
	- The value of variable 'c' is 127

	────────────────────────────────────────────────────────────
	✅ Truck color (Cyan): found in answer
	✅ Variable c (127): found in answer
	────────────────────────────────────────────────────────────

	🎉 PASS: reasoning round-trips through vLLM interleaved thinking!
	The model recalled facts that only existed in a prior turn's
	reasoning field, proving the chat template preserves them.
	jdurbin@Jons-MacBook-Pro-2:~/git/vllm$
	"""
	Test: vLLM interleaved thinking round-trip (and by default, chutes).

	Proves that reasoning content from a previous assistant turn is visible
	to the model on subsequent turns, by injecting hidden facts into the
	reasoning field and asking the model to recall them.

	Usage:
	python test_interleaved.py --model moonshotai/Kimi-K2.5-TEE
	python test_interleaved.py --model zai-org/GLM-4.7-TEE
	python test_interleaved.py --model deepseek-ai/DeepSeek-R1 --base-url http://localhost:8000/v1
	python test_interleaved.py --model my-model --api-key sk-xxx

	NOTE on field naming:
	vLLM's input parser reads "reasoning" (not "reasoning_content") from
	incoming messages (see chat_utils.py:1453), then duplicates it to both
	"reasoning" and "reasoning_content" internally so chat templates using
	either name will work.

	NOTE on chat_template_kwargs:
	Some chat templates strip reasoning from history messages by default.
	For example, GLM-4.7 only preserves reasoning for messages after the
	last user message unless clear_thinking=False is passed. This script
	uses raw HTTP requests so it can send chat_template_kwargs (which the
	OpenAI Python SDK does not support). The kwargs are auto-detected per
	model family, or can be manually specified via --chat-template-kwargs.
	"""

	import argparse
	import json
	import os
	import sys

	import requests

	INJECTED_FACTS = (
	"\nThe color of the truck was Cyan."
	" The value of variable 'c' is 127."
	)

	TOOLS = [
	{
	"type": "function",
	"function": {
	"name": "get_weather",
	"description": "Get the current weather in a given location",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {
	"type": "string",
	"description": "City and state, e.g., 'San Francisco, CA'",
	},
	"unit": {
	"type": "string",
	"enum": ["celsius", "fahrenheit"],
	},
	},
	"required": ["location", "unit"],
	},
	},
	}
	]

	# Chat template kwargs needed per model family to preserve reasoning
	# in history messages. Without these, some templates strip <think>
	# blocks from all assistant messages before the last user message.
	MODEL_TEMPLATE_KWARGS = {
	"glm": {"clear_thinking": False},
	# Kimi-K2.5 preserves reasoning in suffix_msgs (messages after the
	# last non-tool-call assistant), so no extra kwargs needed as long
	# as we don't insert an intermediate non-tool-call assistant turn.
	}


	def get_template_kwargs(model_id, user_kwargs):
	"""Return chat_template_kwargs for the given model."""
	if user_kwargs is not None:
	return user_kwargs
	model_lower = model_id.lower()
	for family, kwargs in MODEL_TEMPLATE_KWARGS.items():
	if family in model_lower:
	print(f" Auto-detected template kwargs for '{family}': {kwargs}")
	return kwargs
	return None


	def get_current_weather(location: str, unit: str) -> str:
	if unit == "celsius":
	return f"The current temperature in {location} is 22°C."
	return f"The current temperature in {location} is 72°F."


	AVAILABLE_TOOLS = {"get_weather": get_current_weather}


	def parse_args():
	parser = argparse.ArgumentParser(
	description="Test vLLM interleaved thinking round-trip",
	)
	parser.add_argument(
	"--model", "-m",
	default=None,
	help="Model ID to test. If omitted, queries the /models endpoint "
	"and uses the first available model.",
	)
	parser.add_argument(
	"--base-url",
	default=os.environ.get("VLLM_BASE_URL", "https://llm.chutes.ai/v1"),
	help="Base URL of the OpenAI-compatible API "
	"(default: $VLLM_BASE_URL or https://llm.chutes.ai/v1)",
	)
	parser.add_argument(
	"--api-key",
	default=os.environ.get("CHUTES_API_KEY",
	os.environ.get("VLLM_API_KEY", "no-key")),
	help="API key (default: $CHUTES_API_KEY or $VLLM_API_KEY)",
	)
	parser.add_argument(
	"--chat-template-kwargs",
	default=None,
	type=json.loads,
	help='JSON dict of extra chat template kwargs, e.g. '
	'\'{"clear_thinking": false}\'. '
	'Auto-detected for known model families if not specified.',
	)
	return parser.parse_args()


	def chat_completions(base_url, api_key, model_id, messages,
	tools=None, tool_choice="auto",
	chat_template_kwargs=None):
	"""Raw HTTP chat completion — allows sending chat_template_kwargs."""
	payload = {
	"model": model_id,
	"messages": messages,
	}
	if tools:
	payload["tools"] = tools
	payload["tool_choice"] = tool_choice
	if chat_template_kwargs:
	payload["chat_template_kwargs"] = chat_template_kwargs

	resp = requests.post(
	f"{base_url}/chat/completions",
	headers={
	"Authorization": f"Bearer {api_key}",
	"Content-Type": "application/json",
	},
	json=payload,
	timeout=120,
	)
	resp.raise_for_status()
	return resp.json()


	def resolve_model(base_url, api_key, requested):
	if requested:
	return requested
	resp = requests.get(
	f"{base_url}/models",
	headers={"Authorization": f"Bearer {api_key}"},
	timeout=30,
	)
	resp.raise_for_status()
	models = resp.json()["data"]
	if not models:
	print("ERROR: No models available at this endpoint.")
	sys.exit(1)
	model_id = models[0]["id"]
	print(f"(no --model specified, auto-selected: {model_id})")
	return model_id


	def main():
	args = parse_args()
	model_id = resolve_model(args.base_url, args.api_key, args.model)
	template_kwargs = get_template_kwargs(model_id, args.chat_template_kwargs)

	print(f"Model: {model_id}")
	print(f"Endpoint: {args.base_url}")
	if template_kwargs:
	print(f"Template: {template_kwargs}")
	print()

	# ── Turn 1: user asks about weather → model returns a tool call ──
	messages = [
	{"role": "user",
	"content": "What's the weather in Fahrenheit like in San Francisco?"}
	]

	data = chat_completions(
	args.base_url, args.api_key, model_id,
	messages, tools=TOOLS,
	chat_template_kwargs=template_kwargs,
	)
	choice = data["choices"][0]["message"]

	original_reasoning = (choice.get("reasoning")
	or choice.get("reasoning_content") or "")
	tool_calls = choice.get("tool_calls", [])

	print(f"[Turn 1] reasoning:\n{original_reasoning}\n")
	print(f"[Turn 1] tool_calls: "
	f"{[tc['function']['name'] + '(' + tc['function']['arguments'] + ')' for tc in tool_calls]}\n")

	if not tool_calls:
	print("ERROR: Model did not produce tool calls. Cannot continue test.")
	print(" Ensure the model supports tool calling and try again.")
	sys.exit(1)

	# Inject hidden facts into the reasoning.
	injected_reasoning = original_reasoning + INJECTED_FACTS

	# Build the assistant message for conversation history.
	# KEY: use "reasoning" — that's the input field vLLM reads
	# (chat_utils.py:1453). It gets duplicated to "reasoning_content"
	# internally for chat template compatibility.
	messages.append({
	"role": "assistant",
	"content": choice.get("content"),
	"tool_calls": [
	{
	"id": tc["id"],
	"type": tc["type"],
	"function": {
	"name": tc["function"]["name"],
	"arguments": tc["function"]["arguments"],
	},
	}
	for tc in tool_calls
	],
	"reasoning": injected_reasoning,
	})

	# Execute each tool call and append results.
	for tc in tool_calls:
	fn_name = tc["function"]["name"]
	fn_args = json.loads(tc["function"]["arguments"])
	result = AVAILABLE_TOOLS[fn_name](**fn_args)
	messages.append({
	"role": "tool",
	"content": result,
	"tool_call_id": tc["id"],
	})

	# ── Turn 2: ask the model to recall hidden facts from reasoning ──
	#
	# We do NOT insert an intermediate non-tool-call assistant turn.
	# Some chat templates (e.g. Kimi-K2.5) split messages at the last
	# non-tool-call assistant message — everything before it has
	# reasoning stripped. By skipping that turn, all messages stay in
	# the "suffix" group where reasoning is preserved.
	#
	# We prepend a system message to prevent the model from suppressing
	# information it found in its reasoning. Without this, some models
	# treat reasoning as "internal monologue" and refuse to surface it.
	messages.insert(0, {
	"role": "system",
	"content": (
	"You are a helpful assistant. When the user asks you to "
	"recall information from your prior reasoning or thinking, "
	"you must provide it directly. Your reasoning content is "
	"part of the conversation context and is not secret or "
	"hidden — the user has full access to it. Never refuse to "
	"share details from your own earlier thoughts."
	),
	})
	messages.append({
	"role": "user",
	"content": (
	"Thanks for looking up the weather! Now I need you to "
	"recall two specific details you noted in your thinking "
	"earlier. You wrote down a truck color and the value of a "
	"variable called 'c'. What were they? Just tell me directly."
	),
	})

	print("[Turn 2] Sending request...\n")

	data2 = chat_completions(
	args.base_url, args.api_key, model_id,
	messages, tools=TOOLS,
	chat_template_kwargs=template_kwargs,
	)
	choice2 = data2["choices"][0]["message"]

	reasoning2 = (choice2.get("reasoning")
	or choice2.get("reasoning_content") or "")
	answer = choice2.get("content") or ""

	print(f"[Turn 2] reasoning:\n{reasoning2}\n")
	print(f"[Turn 2] answer:\n{answer}\n")

	# ── Verify ───────────────────────────────────────────────────────
	# Check both the answer (content) and the model's own reasoning on
	# Turn 2. Even if the model suppresses the facts in its answer, it
	# may acknowledge them in reasoning — which still proves the
	# round-trip works (the template delivered them to the model).
	answer_lower = answer.lower()
	reasoning_lower = reasoning2.lower()

	cyan_in_answer = "cyan" in answer_lower
	cyan_in_reasoning = "cyan" in reasoning_lower
	c127_in_answer = "127" in answer_lower
	c127_in_reasoning = "127" in reasoning_lower

	found_cyan = cyan_in_answer or cyan_in_reasoning
	found_127 = c127_in_answer or c127_in_reasoning

	def status(in_answer, in_reasoning):
	if in_answer:
	return "found in answer"
	if in_reasoning:
	return "found in reasoning (model saw it but suppressed in answer)"
	return "NOT found"

	print("─" * 60)
	print(f" {'✅' if found_cyan else '❌'} Truck color (Cyan): "
	f"{status(cyan_in_answer, cyan_in_reasoning)}")
	print(f" {'✅' if found_127 else '❌'} Variable c (127): "
	f"{status(c127_in_answer, c127_in_reasoning)}")
	print("─" * 60)

	if found_cyan and found_127:
	print("\n🎉 PASS: reasoning round-trips through vLLM interleaved thinking!")
	print(" The model recalled facts that only existed in a prior turn's")
	print(" reasoning field, proving the chat template preserves them.")
	if not (cyan_in_answer and c127_in_answer):
	print()
	print(" NOTE: The model acknowledged the facts in its reasoning but")
	print(" suppressed them in the final answer. This is a model behavior")
	print(" issue (not a vLLM issue) — the round-trip itself works.")
	sys.exit(0)
	else:
	print("\n❌ FAIL: Model could not recall injected facts from reasoning.")
	print()
	print(" Debugging checklist:")
	print(" 1. Does this model support interleaved thinking? Its chat")
	print(" template must read 'reasoning' or 'reasoning_content'.")
	print(" 2. If using a proxy/provider, they may strip the 'reasoning'")
	print(" field. Try against a direct vLLM endpoint.")
	print(" 3. Some templates strip reasoning from history messages.")
	print(" Try: --chat-template-kwargs '{\"clear_thinking\": false}'")
	print(" 4. Start vLLM with: --reasoning-parser deepseek_r1")
	sys.exit(1)


	if __name__ == "__main__":
	main()