Skip to content

Instantly share code, notes, and snippets.

@kev-bi
Created March 16, 2026 22:18
Show Gist options
  • Select an option

  • Save kev-bi/b2b6d7abb2fb4c91c8955f0403e56996 to your computer and use it in GitHub Desktop.

Select an option

Save kev-bi/b2b6d7abb2fb4c91c8955f0403e56996 to your computer and use it in GitHub Desktop.
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5post1 speculative decoding multiple MPI process OK
/app/tensorrt_llm# cat > config.yaml << 'EOF'
speculative_config:
decoding_type: Eagle3
max_draft_len: 4
speculative_model: yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
kv_cache_config:
free_gpu_memory_fraction: 0.70
dtype: fp8
enable_block_reuse: false
trust_remote_code: true
EOF
trtllm-serve meta-llama/Llama-3.1-8B-Instruct --tp_size 2 --config config.yaml --port 8080
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc5.post1
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
class ResponseFormat(OpenAIBaseModel):
[2026-03-16 22:15:48] INFO utils.py:35: Downloading model_index.json from HF Hub for meta-llama/Llama-3.1-8B-Instruct...
[2026-03-16 22:15:48] INFO utils.py:43: Downloaded to /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659
[03/16/2026-22:15:48] [TRT-LLM] [W] Overriding speculative_config
[03/16/2026-22:15:48] [TRT-LLM] [W] Overriding kv_cache_config
[03/16/2026-22:15:48] [TRT-LLM] [I] Using LLM with PyTorch backend
[03/16/2026-22:15:48] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[03/16/2026-22:15:48] [TRT-LLM] [I] start MpiSession with 2 workers
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package modelopt. Picked distribution: nvidia-modelopt
Multiple distributions found for package modelopt. Picked distribution: nvidia-modelopt
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc5.post1
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc5.post1
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
class ResponseFormat(OpenAIBaseModel):
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
class ResponseFormat(OpenAIBaseModel):
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
`torch_dtype` is deprecated! Use `dtype` instead!
[03/16/2026-22:15:59] [TRT-LLM] [W] Orchestrator is creating IPC executor
rank 0 using MpiPoolSession to spawn MPI processes
[03/16/2026-22:15:59] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[03/16/2026-22:15:59] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[03/16/2026-22:15:59] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [W] Worker process 944 is affined to run on the following CPUs: [2, 194] (subset of all logical CPUs). This may harm performance if set incorrectly.
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [W] Worker process 944 has constrained CPU affinity but `TLLM_NUMA_AWARE_WORKER_AFFINITY` is not set. Removing CPU affinity constraints.
`torch_dtype` is deprecated! Use `dtype` instead!
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [W] Falling back to greedy decoding for Eagle3. If you want to use non-greedy sampling, please set allow_advanced_sampling=True.
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] ATTENTION RUNTIME FEATURES: AttentionRuntimeFeatures(chunked_prefill=False, cache_reuse=False, has_speculative_draft_tokens=False, chunk_size=8192, chunked_prefill_buffer_batch_size=4)
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Validating KV Cache config against kv_cache_dtype="fp8"
`torch_dtype` is deprecated! Use `dtype` instead!
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Use 7.92 GB for model weights.
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Prefetching 14.96GB checkpoint files.
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Prefetching /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00003-of-00004.safetensors to memory...
[03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Prefetching /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00002-of-00004.safetensors to memory...
[03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00003-of-00004.safetensors.
[03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00002-of-00004.safetensors.
Loading safetensors weights in parallel: 0%| | 0/4 [00:00<?, ?it/s][03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Start to load safetensor file /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00002-of-00004.safetensors
[03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Start to load safetensor file /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00001-of-00004.safetensors
[03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Start to load safetensor file /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00003-of-00004.safetensors
[03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Start to load safetensor file /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00004-of-00004.safetensors
Loading safetensors weights in parallel: 100%|██████████| 4/4 [00:00<00:00, 318.82it/s]
Loading safetensors weights in parallel: 100%|██████████| 4/4 [00:00<00:00, 301.69it/s]
Loading weights concurrently: 100%|██████████| 709/709 [00:01<00:00, 517.67it/s]
Loading bin weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 249.65it/s]
Loading weights concurrently: 100%|██████████| 709/709 [00:01<00:00, 499.12it/s]
Loading bin weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 339.34it/s]
Loading weights concurrently: 100%|██████████| 28/28 [00:00<00:00, 328.35it/s]
Model init total -- 5.05s
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=4223, secondary blocks=0, max sequence length=131083
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.12 GiB for max tokens in paged KV cache (135136).
Loading weights concurrently: 100%|██████████| 28/28 [00:00<00:00, 334.54it/s]
Model init total -- 5.15s
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] max_seq_len is not specified, using inferred value 131072
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=4223, secondary blocks=0, max sequence length=131083
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.52 GiB for max tokens in paged KV cache (135136).
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] Using Sampler: Eagle3OneModelSampler
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [create_py_executor] Created execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [kv cache manager] Primary/secondary blocks for window sizes set to {131083: (4223, 0)} for estimation dry run
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [KVCacheManager] execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=4223, secondary blocks=0, max sequence length=131083
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.12 GiB for max tokens in paged KV cache (135136).
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [kv cache manager] Primary/secondary blocks for window sizes set to {131083: (4223, 0)} for estimation dry run
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [KVCacheManager] execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=4223, secondary blocks=0, max sequence length=131083
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.52 GiB for max tokens in paged KV cache (135136).
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] max_seq_len=131083, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] cache_transceiver is disabled
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [PyExecutor] execution_stream initialized: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>.
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] Running autotuner warmup...
[03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][INFO] Successfully loaded NCCL library: libnccl.so
[TensorRT-LLM][INFO] Successfully loaded NCCL library: libnccl.so
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[03/16/2026-22:16:07] [TRT-LLM] [RANK 0] [I] PDL enabled
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 34169856 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 34169856 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 34169856 bytes to 86507520 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 34169856 bytes to 86507520 bytes
[03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] [Autotuner] Cache size after warmup is 14
[03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Creating CUDA graph instances for 34 batch sizes.
[03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=128, draft_len=4, max_seq_len=131072
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 86507520 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 86507520 bytes
[03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=64, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=32, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=31, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=30, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=29, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=28, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=27, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=26, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=25, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=24, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=23, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=22, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=21, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=20, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=19, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=18, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=17, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=16, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=15, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=14, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=13, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=12, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=11, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=10, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=9, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=8, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=7, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=6, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=5, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=4, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=3, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=2, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=1, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [W] `torch.isnan` or `torch.isinf` is not implemented for current kv cache dtype, related checks are skipped
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] global_steady_clock_offset at each rank: [0.0, 8.00006091594696e-06]
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Setting global_steady_clock_offset: 0.0 seconds for rank 0
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Memory used after loading model weights (inside torch) in memory usage profiling: 15.04 GiB
[03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Memory used after loading model weights (outside torch) in memory usage profiling: 9.53 GiB
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Memory dynamically allocated during inference (inside torch) in memory usage profiling: 0.77 GiB
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Memory used outside torch (e.g., NCCL and CUDA graphs) in memory usage profiling: 9.59 GiB
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Peak memory during memory usage profiling (torch + non-torch): 25.40 GiB, available KV cache memory when calculating max tokens: 110.31 GiB, fraction is set 0.7, kv size is 33792. device total memory 178.35 GiB, , tmp kv_mem 4.64 GiB
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Estimated max memory in KV cache : 110.31 GiB
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.7 and 3505206 with free memory 158.35711669921875GiB of total memory 178.35107421875GiB, respectively). The smaller value will be used.
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [KVCacheManager] execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=109537, secondary blocks=0, max sequence length=131083
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=109537, secondary blocks=0, max sequence length=131083
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 106.97 GiB for max tokens in paged KV cache (3505184).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 106.97 GiB for max tokens in paged KV cache (3505184).
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.7 and 3505206 with free memory 51.38641357421875GiB of total memory 178.35107421875GiB, respectively). The smaller value will be used.
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [KVCacheManager] execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=109537, secondary blocks=0, max sequence length=131083
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 13.37 GiB for max tokens in paged KV cache (3505184).
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=109537, secondary blocks=0, max sequence length=131083
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 13.37 GiB for max tokens in paged KV cache (3505184).
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 34169856 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 34169856 bytes to 86507520 bytes
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] max_seq_len=131083, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] cache_transceiver is disabled
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [PyExecutor] execution_stream initialized: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>.
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Running autotuner warmup...
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 34169856 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 34169856 bytes to 86507520 bytes
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [Autotuner] Cache size after warmup is 14
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Creating CUDA graph instances for 34 batch sizes.
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=128, draft_len=4, max_seq_len=131072
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 86507520 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 86507520 bytes
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=64, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=32, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=31, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=30, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=29, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=28, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=27, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=26, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=25, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=24, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=23, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=22, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=21, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=20, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=19, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=18, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=17, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=16, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=15, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=14, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=13, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=12, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=11, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=10, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=9, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=8, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=7, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=6, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=5, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=4, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=3, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=2, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=1, draft_len=4, max_seq_len=131072
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [W] `torch.isnan` or `torch.isinf` is not implemented for current kv cache dtype, related checks are skipped
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] global_steady_clock_offset at each rank: [0.0, -3.00002284348011e-06]
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] Setting global_steady_clock_offset: 0.0 seconds for rank 0
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] Setting PyTorch memory fraction to 0.3028266618847242 (54.00946044921875 GiB)
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] LLM Args:
model='meta-llama/Llama-3.1-8B-Instruct' tokenizer=None tokenizer_mode='auto' custom_tokenizer=None skip_tokenizer_init=False trust_remote_code=True tensor_parallel_size=2 dtype='auto' revision=None tokenizer_revision=None model_kwargs=None pipeline_parallel_size=1 context_parallel_size=1 gpus_per_node=2 moe_cluster_parallel_size=-1 moe_tensor_parallel_size=-1 moe_expert_parallel_size=-1 enable_attention_dp=False enable_lm_head_tp_in_adp=False pp_partition=None cp_config={} load_format=<LoadFormat.AUTO: 0> enable_lora=False lora_config=None kv_cache_config=KvCacheConfig(enable_block_reuse=False, max_tokens=3505206, max_attention_window=None, sink_token_length=None, free_gpu_memory_fraction=0.7, host_cache_size=None, onboard_blocks=True, cross_kv_cache_fraction=None, secondary_offload_min_priority=None, event_buffer_max_size=0, attention_dp_events_gather_period_ms=5, enable_partial_reuse=True, copy_on_partial_reuse=True, use_uvm=False, max_gpu_total_bytes=118447954329, dtype='fp8', mamba_ssm_cache_dtype='auto', tokens_per_block=32, use_kv_cache_manager_v2=False, max_util_for_resume=0.95) enable_chunked_prefill=False guided_decoding_backend=None batched_logits_processor=None iter_stats_max_iterations=None request_stats_max_iterations=None peft_cache_config=None scheduler_config=SchedulerConfig(capacity_scheduler_policy=<CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: 'GUARANTEED_NO_EVICT'>, context_chunking_policy=None, dynamic_batch_config=None, waiting_queue_policy=<WaitingQueuePolicy.FCFS: 'fcfs'>) cache_transceiver_config=None sparse_attention_config=None speculative_config=Eagle3DecodingConfig(max_draft_len=4, max_total_draft_tokens=4, speculative_model=PosixPath('/root/.cache/huggingface/hub/models--yuhuili--EAGLE3-LLaMA3.1-Instruct-8B/snapshots/ada412b672e293d682423de84a095447bf38a637'), max_concurrency=None, draft_len_schedule=None, load_format=None, acceptance_window=None, acceptance_length_threshold=None, allow_advanced_sampling=False, eagle_choices=None, greedy_sampling=True, posterior_threshold=None, use_dynamic_tree=False, dynamic_tree_max_topK=None, num_eagle_layers=4, max_non_leaves_per_layer=None, eagle3_one_model=True, eagle3_layers_to_capture=None, eagle3_model_arch='llama3') max_batch_size=2048 max_input_len=1024 max_seq_len=None max_beam_width=1 max_num_tokens=8192 gather_generation_logits=False num_postprocess_workers=0 postprocess_tokenizer_dir='meta-llama/Llama-3.1-8B-Instruct' reasoning_parser=None decoding_config=None mpi_session=None otlp_traces_endpoint=None backend='pytorch' return_perf_metrics=False perf_metrics_max_requests=0 orchestrator_type=None env_overrides=None garbage_collection_gen0_threshold=20000 cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], max_batch_size=128, enable_padding=False) attention_dp_config=None disable_overlap_scheduler=False moe_config=MoeConfig(backend='AUTO', max_num_tokens=None, load_balancer=None, disable_finalize_fusion=False, use_low_precision_moe_combine=False) nvfp4_gemm_config=Nvfp4GemmConfig(allowed_backends=['cutlass', 'cublaslt', 'cuda_core']) attn_backend='TRTLLM' sampler_type=<SamplerType.auto: 'auto'> sampler_force_async_worker=False enable_iter_perf_stats=False enable_iter_req_stats=False print_iter_log=False batch_wait_timeout_ms=0 batch_wait_timeout_iters=0 batch_wait_max_tokens_ratio=0 torch_compile_config=None enable_autotuner=True enable_layerwise_nvtx_marker=False enable_min_latency=False stream_interval=1 force_dynamic_quantization=False allreduce_strategy='AUTO' checkpoint_loader=None checkpoint_format='HF' kv_connector_config=None mm_encoder_only=False ray_worker_extension_cls=None ray_placement_config=None enable_sleep=False use_cute_dsl_blockscaling_mm=False use_cute_dsl_blockscaling_bmm=False disable_flashinfer_sampling=False max_stats_len=1000 layer_wise_benchmarks_config=LayerwiseBenchmarksConfig(calibration_mode='NONE', calibration_file_path=None, calibration_layer_indices=None)
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] RPCServer is bound to ipc:///tmp/rpc_test_85a24b81-0bf2-436d-ac92-6749282081d9
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] RPC Server started and listening on ipc:///tmp/rpc_test_85a24b81-0bf2-436d-ac92-6749282081d9
[03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] RPC Server has started.
[03/16/2026-22:16:19] [TRT-LLM] [I] get signal from executor worker
INFO: Started server process [782]
INFO: Waiting for application startup.
INFO: Application startup complete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment