Created
March 16, 2026 22:18
-
-
Save kev-bi/b2b6d7abb2fb4c91c8955f0403e56996 to your computer and use it in GitHub Desktop.
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5post1 speculative decoding multiple MPI process OK
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| /app/tensorrt_llm# cat > config.yaml << 'EOF' | |
| speculative_config: | |
| decoding_type: Eagle3 | |
| max_draft_len: 4 | |
| speculative_model: yuhuili/EAGLE3-LLaMA3.1-Instruct-8B | |
| kv_cache_config: | |
| free_gpu_memory_fraction: 0.70 | |
| dtype: fp8 | |
| enable_block_reuse: false | |
| trust_remote_code: true | |
| EOF | |
| trtllm-serve meta-llama/Llama-3.1-8B-Instruct --tp_size 2 --config config.yaml --port 8080 | |
| /usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden. | |
| Overriding a previously registered kernel for the same operator and the same dispatch key | |
| operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor | |
| registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 | |
| dispatch key: ADInplaceOrView | |
| previous kernel: no debug info | |
| new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.) | |
| self.m.impl( | |
| /usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models. | |
| _warnings.warn( | |
| [TensorRT-LLM] TensorRT LLM version: 1.3.0rc5.post1 | |
| /usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel" | |
| class ResponseFormat(OpenAIBaseModel): | |
| [2026-03-16 22:15:48] INFO utils.py:35: Downloading model_index.json from HF Hub for meta-llama/Llama-3.1-8B-Instruct... | |
| [2026-03-16 22:15:48] INFO utils.py:43: Downloaded to /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659 | |
| [03/16/2026-22:15:48] [TRT-LLM] [W] Overriding speculative_config | |
| [03/16/2026-22:15:48] [TRT-LLM] [W] Overriding kv_cache_config | |
| [03/16/2026-22:15:48] [TRT-LLM] [I] Using LLM with PyTorch backend | |
| [03/16/2026-22:15:48] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF. | |
| [03/16/2026-22:15:48] [TRT-LLM] [I] start MpiSession with 2 workers | |
| /usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden. | |
| Overriding a previously registered kernel for the same operator and the same dispatch key | |
| operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor | |
| registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 | |
| dispatch key: ADInplaceOrView | |
| previous kernel: no debug info | |
| new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.) | |
| self.m.impl( | |
| /usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden. | |
| Overriding a previously registered kernel for the same operator and the same dispatch key | |
| operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor | |
| registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 | |
| dispatch key: ADInplaceOrView | |
| previous kernel: no debug info | |
| new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.) | |
| self.m.impl( | |
| Multiple distributions found for package optimum. Picked distribution: optimum | |
| Multiple distributions found for package optimum. Picked distribution: optimum | |
| Multiple distributions found for package modelopt. Picked distribution: nvidia-modelopt | |
| Multiple distributions found for package modelopt. Picked distribution: nvidia-modelopt | |
| /usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models. | |
| _warnings.warn( | |
| /usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models. | |
| _warnings.warn( | |
| [TensorRT-LLM] TensorRT LLM version: 1.3.0rc5.post1 | |
| [TensorRT-LLM] TensorRT LLM version: 1.3.0rc5.post1 | |
| /usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel" | |
| class ResponseFormat(OpenAIBaseModel): | |
| /usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel" | |
| class ResponseFormat(OpenAIBaseModel): | |
| The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. | |
| You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [03/16/2026-22:15:59] [TRT-LLM] [W] Orchestrator is creating IPC executor | |
| rank 0 using MpiPoolSession to spawn MPI processes | |
| [03/16/2026-22:15:59] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue | |
| [03/16/2026-22:15:59] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue | |
| [03/16/2026-22:15:59] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue | |
| [TensorRT-LLM][INFO] Refreshed the MPI local session | |
| [TensorRT-LLM][INFO] Refreshed the MPI local session | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [W] Worker process 944 is affined to run on the following CPUs: [2, 194] (subset of all logical CPUs). This may harm performance if set incorrectly. | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [W] Worker process 944 has constrained CPU affinity but `TLLM_NUMA_AWARE_WORKER_AFFINITY` is not set. Removing CPU affinity constraints. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [W] Falling back to greedy decoding for Eagle3. If you want to use non-greedy sampling, please set allow_advanced_sampling=True. | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] ATTENTION RUNTIME FEATURES: AttentionRuntimeFeatures(chunked_prefill=False, cache_reuse=False, has_speculative_draft_tokens=False, chunk_size=8192, chunked_prefill_buffer_batch_size=4) | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Validating KV Cache config against kv_cache_dtype="fp8" | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Use 7.92 GB for model weights. | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Prefetching 14.96GB checkpoint files. | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Prefetching /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00003-of-00004.safetensors to memory... | |
| [03/16/2026-22:16:00] [TRT-LLM] [RANK 0] [I] Prefetching /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00002-of-00004.safetensors to memory... | |
| [03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00003-of-00004.safetensors. | |
| [03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00002-of-00004.safetensors. | |
| Loading safetensors weights in parallel: 0%| | 0/4 [00:00<?, ?it/s][03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Start to load safetensor file /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00002-of-00004.safetensors | |
| [03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Start to load safetensor file /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00001-of-00004.safetensors | |
| [03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Start to load safetensor file /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00003-of-00004.safetensors | |
| [03/16/2026-22:16:04] [TRT-LLM] [RANK 0] [I] Start to load safetensor file /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00004-of-00004.safetensors | |
| Loading safetensors weights in parallel: 100%|██████████| 4/4 [00:00<00:00, 318.82it/s] | |
| Loading safetensors weights in parallel: 100%|██████████| 4/4 [00:00<00:00, 301.69it/s] | |
| Loading weights concurrently: 100%|██████████| 709/709 [00:01<00:00, 517.67it/s] | |
| Loading bin weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 249.65it/s] | |
| Loading weights concurrently: 100%|██████████| 709/709 [00:01<00:00, 499.12it/s] | |
| Loading bin weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 339.34it/s] | |
| Loading weights concurrently: 100%|██████████| 28/28 [00:00<00:00, 328.35it/s] | |
| Model init total -- 5.05s | |
| [TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=4223, secondary blocks=0, max sequence length=131083 | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.12 GiB for max tokens in paged KV cache (135136). | |
| Loading weights concurrently: 100%|██████████| 28/28 [00:00<00:00, 334.54it/s] | |
| Model init total -- 5.15s | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] max_seq_len is not specified, using inferred value 131072 | |
| [TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=4223, secondary blocks=0, max sequence length=131083 | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.52 GiB for max tokens in paged KV cache (135136). | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] Using Sampler: Eagle3OneModelSampler | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [create_py_executor] Created execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50> | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [kv cache manager] Primary/secondary blocks for window sizes set to {131083: (4223, 0)} for estimation dry run | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [KVCacheManager] execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50> | |
| [TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=4223, secondary blocks=0, max sequence length=131083 | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.12 GiB for max tokens in paged KV cache (135136). | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [kv cache manager] Primary/secondary blocks for window sizes set to {131083: (4223, 0)} for estimation dry run | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [KVCacheManager] execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50> | |
| [TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=4223, secondary blocks=0, max sequence length=131083 | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.52 GiB for max tokens in paged KV cache (135136). | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] max_seq_len=131083, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048 | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] cache_transceiver is disabled | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [PyExecutor] execution_stream initialized: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>. | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] Running autotuner warmup... | |
| [03/16/2026-22:16:05] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ... | |
| [TensorRT-LLM][INFO] Successfully loaded NCCL library: libnccl.so | |
| [TensorRT-LLM][INFO] Successfully loaded NCCL library: libnccl.so | |
| [TensorRT-LLM][INFO] Detecting local TP group for rank 0 | |
| [TensorRT-LLM][INFO] Detecting local TP group for rank 1 | |
| [TensorRT-LLM][INFO] TP group is intra-node for rank 0 | |
| [TensorRT-LLM][INFO] TP group is intra-node for rank 1 | |
| [03/16/2026-22:16:07] [TRT-LLM] [RANK 0] [I] PDL enabled | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 34169856 bytes | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 34169856 bytes | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 34169856 bytes to 86507520 bytes | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 34169856 bytes to 86507520 bytes | |
| [03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends | |
| [03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] [Autotuner] Cache size after warmup is 14 | |
| [03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Creating CUDA graph instances for 34 batch sizes. | |
| [03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=128, draft_len=4, max_seq_len=131072 | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 86507520 bytes | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 86507520 bytes | |
| [03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=64, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=32, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:09] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=31, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=30, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=29, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=28, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=27, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=26, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:10] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=25, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=24, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=23, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=22, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=21, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=20, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:11] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=19, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=18, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=17, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=16, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=15, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=14, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=13, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:12] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=12, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=11, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=10, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=9, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=8, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=7, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:13] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=6, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=5, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=4, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=3, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=2, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=1, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [W] `torch.isnan` or `torch.isinf` is not implemented for current kv cache dtype, related checks are skipped | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] global_steady_clock_offset at each rank: [0.0, 8.00006091594696e-06] | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Setting global_steady_clock_offset: 0.0 seconds for rank 0 | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Memory used after loading model weights (inside torch) in memory usage profiling: 15.04 GiB | |
| [03/16/2026-22:16:14] [TRT-LLM] [RANK 0] [I] Memory used after loading model weights (outside torch) in memory usage profiling: 9.53 GiB | |
| [TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048 | |
| [TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048 | |
| [TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048 | |
| [TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048 | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Memory dynamically allocated during inference (inside torch) in memory usage profiling: 0.77 GiB | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Memory used outside torch (e.g., NCCL and CUDA graphs) in memory usage profiling: 9.59 GiB | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Peak memory during memory usage profiling (torch + non-torch): 25.40 GiB, available KV cache memory when calculating max tokens: 110.31 GiB, fraction is set 0.7, kv size is 33792. device total memory 178.35 GiB, , tmp kv_mem 4.64 GiB | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Estimated max memory in KV cache : 110.31 GiB | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.7 and 3505206 with free memory 158.35711669921875GiB of total memory 178.35107421875GiB, respectively). The smaller value will be used. | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [KVCacheManager] execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50> | |
| [TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=109537, secondary blocks=0, max sequence length=131083 | |
| [TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=109537, secondary blocks=0, max sequence length=131083 | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 106.97 GiB for max tokens in paged KV cache (3505184). | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 106.97 GiB for max tokens in paged KV cache (3505184). | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.7 and 3505206 with free memory 51.38641357421875GiB of total memory 178.35107421875GiB, respectively). The smaller value will be used. | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [KVCacheManager] execution_stream: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50> | |
| [TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=109537, secondary blocks=0, max sequence length=131083 | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 13.37 GiB for max tokens in paged KV cache (3505184). | |
| [TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=131083], tokens per block=32, primary blocks=109537, secondary blocks=0, max sequence length=131083 | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 13.37 GiB for max tokens in paged KV cache (3505184). | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 34169856 bytes | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 34169856 bytes to 86507520 bytes | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] max_seq_len=131083, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048 | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] cache_transceiver is disabled | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [PyExecutor] execution_stream initialized: <torch.cuda.Stream device=cuda:0 cuda_stream=0x18305f50>. | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Running autotuner warmup... | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ... | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 34169856 bytes | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 34169856 bytes to 86507520 bytes | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] [Autotuner] Cache size after warmup is 14 | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Creating CUDA graph instances for 34 batch sizes. | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=128, draft_len=4, max_seq_len=131072 | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 86507520 bytes | |
| [TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 86507520 bytes | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=64, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:15] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=32, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=31, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=30, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=29, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=28, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=27, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=26, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=25, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=24, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:16] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=23, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=22, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=21, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=20, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=19, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=18, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=17, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=16, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=15, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=14, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:17] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=13, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=12, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=11, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=10, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=9, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=8, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=7, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=6, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=5, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=4, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=3, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:18] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=2, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] Run generation-only CUDA graph warmup for batch size=1, draft_len=4, max_seq_len=131072 | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [W] `torch.isnan` or `torch.isinf` is not implemented for current kv cache dtype, related checks are skipped | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] global_steady_clock_offset at each rank: [0.0, -3.00002284348011e-06] | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] Setting global_steady_clock_offset: 0.0 seconds for rank 0 | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] Setting PyTorch memory fraction to 0.3028266618847242 (54.00946044921875 GiB) | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] LLM Args: | |
| model='meta-llama/Llama-3.1-8B-Instruct' tokenizer=None tokenizer_mode='auto' custom_tokenizer=None skip_tokenizer_init=False trust_remote_code=True tensor_parallel_size=2 dtype='auto' revision=None tokenizer_revision=None model_kwargs=None pipeline_parallel_size=1 context_parallel_size=1 gpus_per_node=2 moe_cluster_parallel_size=-1 moe_tensor_parallel_size=-1 moe_expert_parallel_size=-1 enable_attention_dp=False enable_lm_head_tp_in_adp=False pp_partition=None cp_config={} load_format=<LoadFormat.AUTO: 0> enable_lora=False lora_config=None kv_cache_config=KvCacheConfig(enable_block_reuse=False, max_tokens=3505206, max_attention_window=None, sink_token_length=None, free_gpu_memory_fraction=0.7, host_cache_size=None, onboard_blocks=True, cross_kv_cache_fraction=None, secondary_offload_min_priority=None, event_buffer_max_size=0, attention_dp_events_gather_period_ms=5, enable_partial_reuse=True, copy_on_partial_reuse=True, use_uvm=False, max_gpu_total_bytes=118447954329, dtype='fp8', mamba_ssm_cache_dtype='auto', tokens_per_block=32, use_kv_cache_manager_v2=False, max_util_for_resume=0.95) enable_chunked_prefill=False guided_decoding_backend=None batched_logits_processor=None iter_stats_max_iterations=None request_stats_max_iterations=None peft_cache_config=None scheduler_config=SchedulerConfig(capacity_scheduler_policy=<CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: 'GUARANTEED_NO_EVICT'>, context_chunking_policy=None, dynamic_batch_config=None, waiting_queue_policy=<WaitingQueuePolicy.FCFS: 'fcfs'>) cache_transceiver_config=None sparse_attention_config=None speculative_config=Eagle3DecodingConfig(max_draft_len=4, max_total_draft_tokens=4, speculative_model=PosixPath('/root/.cache/huggingface/hub/models--yuhuili--EAGLE3-LLaMA3.1-Instruct-8B/snapshots/ada412b672e293d682423de84a095447bf38a637'), max_concurrency=None, draft_len_schedule=None, load_format=None, acceptance_window=None, acceptance_length_threshold=None, allow_advanced_sampling=False, eagle_choices=None, greedy_sampling=True, posterior_threshold=None, use_dynamic_tree=False, dynamic_tree_max_topK=None, num_eagle_layers=4, max_non_leaves_per_layer=None, eagle3_one_model=True, eagle3_layers_to_capture=None, eagle3_model_arch='llama3') max_batch_size=2048 max_input_len=1024 max_seq_len=None max_beam_width=1 max_num_tokens=8192 gather_generation_logits=False num_postprocess_workers=0 postprocess_tokenizer_dir='meta-llama/Llama-3.1-8B-Instruct' reasoning_parser=None decoding_config=None mpi_session=None otlp_traces_endpoint=None backend='pytorch' return_perf_metrics=False perf_metrics_max_requests=0 orchestrator_type=None env_overrides=None garbage_collection_gen0_threshold=20000 cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], max_batch_size=128, enable_padding=False) attention_dp_config=None disable_overlap_scheduler=False moe_config=MoeConfig(backend='AUTO', max_num_tokens=None, load_balancer=None, disable_finalize_fusion=False, use_low_precision_moe_combine=False) nvfp4_gemm_config=Nvfp4GemmConfig(allowed_backends=['cutlass', 'cublaslt', 'cuda_core']) attn_backend='TRTLLM' sampler_type=<SamplerType.auto: 'auto'> sampler_force_async_worker=False enable_iter_perf_stats=False enable_iter_req_stats=False print_iter_log=False batch_wait_timeout_ms=0 batch_wait_timeout_iters=0 batch_wait_max_tokens_ratio=0 torch_compile_config=None enable_autotuner=True enable_layerwise_nvtx_marker=False enable_min_latency=False stream_interval=1 force_dynamic_quantization=False allreduce_strategy='AUTO' checkpoint_loader=None checkpoint_format='HF' kv_connector_config=None mm_encoder_only=False ray_worker_extension_cls=None ray_placement_config=None enable_sleep=False use_cute_dsl_blockscaling_mm=False use_cute_dsl_blockscaling_bmm=False disable_flashinfer_sampling=False max_stats_len=1000 layer_wise_benchmarks_config=LayerwiseBenchmarksConfig(calibration_mode='NONE', calibration_file_path=None, calibration_layer_indices=None) | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] RPCServer is bound to ipc:///tmp/rpc_test_85a24b81-0bf2-436d-ac92-6749282081d9 | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] RPC Server started and listening on ipc:///tmp/rpc_test_85a24b81-0bf2-436d-ac92-6749282081d9 | |
| [03/16/2026-22:16:19] [TRT-LLM] [RANK 0] [I] RPC Server has started. | |
| [03/16/2026-22:16:19] [TRT-LLM] [I] get signal from executor worker | |
| INFO: Started server process [782] | |
| INFO: Waiting for application startup. | |
| INFO: Application startup complete. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment