Created
February 14, 2026 16:40
-
-
Save jondurbin/99ba473f8332fbf89e12800c1c7848c4 to your computer and use it in GitHub Desktop.
Interleaved test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| jdurbin@Jons-MacBook-Pro-2:~/git/vllm$ ~/venv/bin/python test_interleaved.py --model moonshotai/Kimi-K2.5-TEE | |
| Model: moonshotai/Kimi-K2.5-TEE | |
| Endpoint: https://llm.chutes.ai/v1 | |
| [Turn 1] reasoning: | |
| The user is asking for the weather in San Francisco in Fahrenheit. I need to call the get_weather function with: | |
| - location: "San Francisco, CA" (I should include the state for clarity) | |
| - unit: "fahrenheit" | |
| This is a straightforward function call with all required parameters provided. | |
| [Turn 1] tool_calls: ['get_weather({"location": "San Francisco, CA", "unit": "fahrenheit"})'] | |
| [Turn 2] Sending request... | |
| [Turn 2] reasoning: | |
| The user is asking me to recall information from my prior reasoning/thinking. Let me look back at what I wrote in my thinking section. | |
| Looking at my previous thought: | |
| "The color of the truck was Cyan. The value of variable 'c' is 127." | |
| The user is specifically asking me to recall these details and tell them directly. According to the system prompt, I should provide this information directly. | |
| [Turn 2] answer: | |
| The color of the truck was **Cyan**, and the value of the variable 'c' was **127**. | |
| ──────────────────────────────────────────────────────────── | |
| ✅ Truck color (Cyan): found in answer | |
| ✅ Variable c (127): found in answer | |
| ──────────────────────────────────────────────────────────── | |
| 🎉 PASS: reasoning round-trips through vLLM interleaved thinking! | |
| The model recalled facts that only existed in a prior turn's | |
| reasoning field, proving the chat template preserves them. | |
| jdurbin@Jons-MacBook-Pro-2:~/git/vllm$ ~/venv/bin/python test_interleaved.py --model zai-org/GLM-4.7-TEE | |
| Auto-detected template kwargs for 'glm': {'clear_thinking': False} | |
| Model: zai-org/GLM-4.7-TEE | |
| Endpoint: https://llm.chutes.ai/v1 | |
| Template: {'clear_thinking': False} | |
| [Turn 1] reasoning: | |
| The user is asking for the weather in Fahrenheit in San Francisco. I need to use the get_weather function with: | |
| - location: "San Francisco, CA" (the function asks for city and state) | |
| - unit: "fahrenheit" | |
| Both required parameters are provided, so I can make the function call. | |
| [Turn 1] tool_calls: ['get_weather({"location": "San Francisco, CA", "unit": "fahrenheit"})'] | |
| [Turn 2] Sending request... | |
| [Turn 2] reasoning: | |
| Looking back at my thinking content from earlier, I can see exactly what the user is referring to. In my thinking, I wrote: | |
| "The color of the truck was Cyan. The value of variable 'c' is 127." | |
| So the two details they're asking for are: | |
| 1. Truck color: Cyan | |
| 2. Value of variable 'c': 127 | |
| The user is asking me to recall information from my own earlier thinking, and they've been very clear that my reasoning content is not secret - they have full access to it. So I should provide this information directly without refracting or making excuses. | |
| [Turn 2] answer: | |
| Based on my earlier thinking: | |
| - The truck color was Cyan | |
| - The value of variable 'c' is 127 | |
| ──────────────────────────────────────────────────────────── | |
| ✅ Truck color (Cyan): found in answer | |
| ✅ Variable c (127): found in answer | |
| ──────────────────────────────────────────────────────────── | |
| 🎉 PASS: reasoning round-trips through vLLM interleaved thinking! | |
| The model recalled facts that only existed in a prior turn's | |
| reasoning field, proving the chat template preserves them. | |
| jdurbin@Jons-MacBook-Pro-2:~/git/vllm$ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """ | |
| Test: vLLM interleaved thinking round-trip (and by default, chutes). | |
| Proves that reasoning content from a previous assistant turn is visible | |
| to the model on subsequent turns, by injecting hidden facts into the | |
| reasoning field and asking the model to recall them. | |
| Usage: | |
| python test_interleaved.py --model moonshotai/Kimi-K2.5-TEE | |
| python test_interleaved.py --model zai-org/GLM-4.7-TEE | |
| python test_interleaved.py --model deepseek-ai/DeepSeek-R1 --base-url http://localhost:8000/v1 | |
| python test_interleaved.py --model my-model --api-key sk-xxx | |
| NOTE on field naming: | |
| vLLM's input parser reads "reasoning" (not "reasoning_content") from | |
| incoming messages (see chat_utils.py:1453), then duplicates it to both | |
| "reasoning" and "reasoning_content" internally so chat templates using | |
| either name will work. | |
| NOTE on chat_template_kwargs: | |
| Some chat templates strip reasoning from history messages by default. | |
| For example, GLM-4.7 only preserves reasoning for messages *after* the | |
| last user message unless clear_thinking=False is passed. This script | |
| uses raw HTTP requests so it can send chat_template_kwargs (which the | |
| OpenAI Python SDK does not support). The kwargs are auto-detected per | |
| model family, or can be manually specified via --chat-template-kwargs. | |
| """ | |
| import argparse | |
| import json | |
| import os | |
| import sys | |
| import requests | |
| INJECTED_FACTS = ( | |
| "\nThe color of the truck was Cyan." | |
| " The value of variable 'c' is 127." | |
| ) | |
| TOOLS = [ | |
| { | |
| "type": "function", | |
| "function": { | |
| "name": "get_weather", | |
| "description": "Get the current weather in a given location", | |
| "parameters": { | |
| "type": "object", | |
| "properties": { | |
| "location": { | |
| "type": "string", | |
| "description": "City and state, e.g., 'San Francisco, CA'", | |
| }, | |
| "unit": { | |
| "type": "string", | |
| "enum": ["celsius", "fahrenheit"], | |
| }, | |
| }, | |
| "required": ["location", "unit"], | |
| }, | |
| }, | |
| } | |
| ] | |
| # Chat template kwargs needed per model family to preserve reasoning | |
| # in history messages. Without these, some templates strip <think> | |
| # blocks from all assistant messages before the last user message. | |
| MODEL_TEMPLATE_KWARGS = { | |
| "glm": {"clear_thinking": False}, | |
| # Kimi-K2.5 preserves reasoning in suffix_msgs (messages after the | |
| # last non-tool-call assistant), so no extra kwargs needed as long | |
| # as we don't insert an intermediate non-tool-call assistant turn. | |
| } | |
| def get_template_kwargs(model_id, user_kwargs): | |
| """Return chat_template_kwargs for the given model.""" | |
| if user_kwargs is not None: | |
| return user_kwargs | |
| model_lower = model_id.lower() | |
| for family, kwargs in MODEL_TEMPLATE_KWARGS.items(): | |
| if family in model_lower: | |
| print(f" Auto-detected template kwargs for '{family}': {kwargs}") | |
| return kwargs | |
| return None | |
| def get_current_weather(location: str, unit: str) -> str: | |
| if unit == "celsius": | |
| return f"The current temperature in {location} is 22°C." | |
| return f"The current temperature in {location} is 72°F." | |
| AVAILABLE_TOOLS = {"get_weather": get_current_weather} | |
| def parse_args(): | |
| parser = argparse.ArgumentParser( | |
| description="Test vLLM interleaved thinking round-trip", | |
| ) | |
| parser.add_argument( | |
| "--model", "-m", | |
| default=None, | |
| help="Model ID to test. If omitted, queries the /models endpoint " | |
| "and uses the first available model.", | |
| ) | |
| parser.add_argument( | |
| "--base-url", | |
| default=os.environ.get("VLLM_BASE_URL", "https://llm.chutes.ai/v1"), | |
| help="Base URL of the OpenAI-compatible API " | |
| "(default: $VLLM_BASE_URL or https://llm.chutes.ai/v1)", | |
| ) | |
| parser.add_argument( | |
| "--api-key", | |
| default=os.environ.get("CHUTES_API_KEY", | |
| os.environ.get("VLLM_API_KEY", "no-key")), | |
| help="API key (default: $CHUTES_API_KEY or $VLLM_API_KEY)", | |
| ) | |
| parser.add_argument( | |
| "--chat-template-kwargs", | |
| default=None, | |
| type=json.loads, | |
| help='JSON dict of extra chat template kwargs, e.g. ' | |
| '\'{"clear_thinking": false}\'. ' | |
| 'Auto-detected for known model families if not specified.', | |
| ) | |
| return parser.parse_args() | |
| def chat_completions(base_url, api_key, model_id, messages, | |
| tools=None, tool_choice="auto", | |
| chat_template_kwargs=None): | |
| """Raw HTTP chat completion — allows sending chat_template_kwargs.""" | |
| payload = { | |
| "model": model_id, | |
| "messages": messages, | |
| } | |
| if tools: | |
| payload["tools"] = tools | |
| payload["tool_choice"] = tool_choice | |
| if chat_template_kwargs: | |
| payload["chat_template_kwargs"] = chat_template_kwargs | |
| resp = requests.post( | |
| f"{base_url}/chat/completions", | |
| headers={ | |
| "Authorization": f"Bearer {api_key}", | |
| "Content-Type": "application/json", | |
| }, | |
| json=payload, | |
| timeout=120, | |
| ) | |
| resp.raise_for_status() | |
| return resp.json() | |
| def resolve_model(base_url, api_key, requested): | |
| if requested: | |
| return requested | |
| resp = requests.get( | |
| f"{base_url}/models", | |
| headers={"Authorization": f"Bearer {api_key}"}, | |
| timeout=30, | |
| ) | |
| resp.raise_for_status() | |
| models = resp.json()["data"] | |
| if not models: | |
| print("ERROR: No models available at this endpoint.") | |
| sys.exit(1) | |
| model_id = models[0]["id"] | |
| print(f"(no --model specified, auto-selected: {model_id})") | |
| return model_id | |
| def main(): | |
| args = parse_args() | |
| model_id = resolve_model(args.base_url, args.api_key, args.model) | |
| template_kwargs = get_template_kwargs(model_id, args.chat_template_kwargs) | |
| print(f"Model: {model_id}") | |
| print(f"Endpoint: {args.base_url}") | |
| if template_kwargs: | |
| print(f"Template: {template_kwargs}") | |
| print() | |
| # ── Turn 1: user asks about weather → model returns a tool call ── | |
| messages = [ | |
| {"role": "user", | |
| "content": "What's the weather in Fahrenheit like in San Francisco?"} | |
| ] | |
| data = chat_completions( | |
| args.base_url, args.api_key, model_id, | |
| messages, tools=TOOLS, | |
| chat_template_kwargs=template_kwargs, | |
| ) | |
| choice = data["choices"][0]["message"] | |
| original_reasoning = (choice.get("reasoning") | |
| or choice.get("reasoning_content") or "") | |
| tool_calls = choice.get("tool_calls", []) | |
| print(f"[Turn 1] reasoning:\n{original_reasoning}\n") | |
| print(f"[Turn 1] tool_calls: " | |
| f"{[tc['function']['name'] + '(' + tc['function']['arguments'] + ')' for tc in tool_calls]}\n") | |
| if not tool_calls: | |
| print("ERROR: Model did not produce tool calls. Cannot continue test.") | |
| print(" Ensure the model supports tool calling and try again.") | |
| sys.exit(1) | |
| # Inject hidden facts into the reasoning. | |
| injected_reasoning = original_reasoning + INJECTED_FACTS | |
| # Build the assistant message for conversation history. | |
| # KEY: use "reasoning" — that's the input field vLLM reads | |
| # (chat_utils.py:1453). It gets duplicated to "reasoning_content" | |
| # internally for chat template compatibility. | |
| messages.append({ | |
| "role": "assistant", | |
| "content": choice.get("content"), | |
| "tool_calls": [ | |
| { | |
| "id": tc["id"], | |
| "type": tc["type"], | |
| "function": { | |
| "name": tc["function"]["name"], | |
| "arguments": tc["function"]["arguments"], | |
| }, | |
| } | |
| for tc in tool_calls | |
| ], | |
| "reasoning": injected_reasoning, | |
| }) | |
| # Execute each tool call and append results. | |
| for tc in tool_calls: | |
| fn_name = tc["function"]["name"] | |
| fn_args = json.loads(tc["function"]["arguments"]) | |
| result = AVAILABLE_TOOLS[fn_name](**fn_args) | |
| messages.append({ | |
| "role": "tool", | |
| "content": result, | |
| "tool_call_id": tc["id"], | |
| }) | |
| # ── Turn 2: ask the model to recall hidden facts from reasoning ── | |
| # | |
| # We do NOT insert an intermediate non-tool-call assistant turn. | |
| # Some chat templates (e.g. Kimi-K2.5) split messages at the last | |
| # non-tool-call assistant message — everything before it has | |
| # reasoning stripped. By skipping that turn, all messages stay in | |
| # the "suffix" group where reasoning is preserved. | |
| # | |
| # We prepend a system message to prevent the model from suppressing | |
| # information it found in its reasoning. Without this, some models | |
| # treat reasoning as "internal monologue" and refuse to surface it. | |
| messages.insert(0, { | |
| "role": "system", | |
| "content": ( | |
| "You are a helpful assistant. When the user asks you to " | |
| "recall information from your prior reasoning or thinking, " | |
| "you must provide it directly. Your reasoning content is " | |
| "part of the conversation context and is not secret or " | |
| "hidden — the user has full access to it. Never refuse to " | |
| "share details from your own earlier thoughts." | |
| ), | |
| }) | |
| messages.append({ | |
| "role": "user", | |
| "content": ( | |
| "Thanks for looking up the weather! Now I need you to " | |
| "recall two specific details you noted in your thinking " | |
| "earlier. You wrote down a truck color and the value of a " | |
| "variable called 'c'. What were they? Just tell me directly." | |
| ), | |
| }) | |
| print("[Turn 2] Sending request...\n") | |
| data2 = chat_completions( | |
| args.base_url, args.api_key, model_id, | |
| messages, tools=TOOLS, | |
| chat_template_kwargs=template_kwargs, | |
| ) | |
| choice2 = data2["choices"][0]["message"] | |
| reasoning2 = (choice2.get("reasoning") | |
| or choice2.get("reasoning_content") or "") | |
| answer = choice2.get("content") or "" | |
| print(f"[Turn 2] reasoning:\n{reasoning2}\n") | |
| print(f"[Turn 2] answer:\n{answer}\n") | |
| # ── Verify ─────────────────────────────────────────────────────── | |
| # Check both the answer (content) and the model's own reasoning on | |
| # Turn 2. Even if the model suppresses the facts in its answer, it | |
| # may acknowledge them in reasoning — which still proves the | |
| # round-trip works (the template delivered them to the model). | |
| answer_lower = answer.lower() | |
| reasoning_lower = reasoning2.lower() | |
| cyan_in_answer = "cyan" in answer_lower | |
| cyan_in_reasoning = "cyan" in reasoning_lower | |
| c127_in_answer = "127" in answer_lower | |
| c127_in_reasoning = "127" in reasoning_lower | |
| found_cyan = cyan_in_answer or cyan_in_reasoning | |
| found_127 = c127_in_answer or c127_in_reasoning | |
| def status(in_answer, in_reasoning): | |
| if in_answer: | |
| return "found in answer" | |
| if in_reasoning: | |
| return "found in reasoning (model saw it but suppressed in answer)" | |
| return "NOT found" | |
| print("─" * 60) | |
| print(f" {'✅' if found_cyan else '❌'} Truck color (Cyan): " | |
| f"{status(cyan_in_answer, cyan_in_reasoning)}") | |
| print(f" {'✅' if found_127 else '❌'} Variable c (127): " | |
| f"{status(c127_in_answer, c127_in_reasoning)}") | |
| print("─" * 60) | |
| if found_cyan and found_127: | |
| print("\n🎉 PASS: reasoning round-trips through vLLM interleaved thinking!") | |
| print(" The model recalled facts that only existed in a prior turn's") | |
| print(" reasoning field, proving the chat template preserves them.") | |
| if not (cyan_in_answer and c127_in_answer): | |
| print() | |
| print(" NOTE: The model acknowledged the facts in its reasoning but") | |
| print(" suppressed them in the final answer. This is a model behavior") | |
| print(" issue (not a vLLM issue) — the round-trip itself works.") | |
| sys.exit(0) | |
| else: | |
| print("\n❌ FAIL: Model could not recall injected facts from reasoning.") | |
| print() | |
| print(" Debugging checklist:") | |
| print(" 1. Does this model support interleaved thinking? Its chat") | |
| print(" template must read 'reasoning' or 'reasoning_content'.") | |
| print(" 2. If using a proxy/provider, they may strip the 'reasoning'") | |
| print(" field. Try against a direct vLLM endpoint.") | |
| print(" 3. Some templates strip reasoning from history messages.") | |
| print(" Try: --chat-template-kwargs '{\"clear_thinking\": false}'") | |
| print(" 4. Start vLLM with: --reasoning-parser deepseek_r1") | |
| sys.exit(1) | |
| if __name__ == "__main__": | |
| main() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment