Created
March 7, 2025 17:57
-
-
Save sozercan/8abc8bdc1171a284dd2588f73632e14d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Defaulted container "ray-worker" out of: ray-worker, wait-gcs-ready (init) | |
INFO 03-07 09:56:00 __init__.py:183] Automatically detected platform cuda. | |
INFO 03-07 09:56:01 api_server.py:838] vLLM API server version 0.7.1 | |
INFO 03-07 09:56:01 api_server.py:839] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='facebook/opt-125m', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) | |
INFO 03-07 09:56:01 api_server.py:204] Started engine process with PID 37 | |
INFO 03-07 09:56:04 __init__.py:183] Automatically detected platform cuda. | |
INFO 03-07 09:56:09 config.py:526] This model supports multiple tasks: {'score', 'generate', 'embed', 'reward', 'classify'}. Defaulting to 'generate'. | |
INFO 03-07 09:56:13 config.py:526] This model supports multiple tasks: {'embed', 'generate', 'classify', 'reward', 'score'}. Defaulting to 'generate'. | |
INFO 03-07 09:56:13 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, | |
INFO 03-07 09:56:15 cuda.py:235] Using Flash Attention backend. | |
INFO 03-07 09:56:15 model_runner.py:1111] Starting to load model facebook/opt-125m... | |
INFO 03-07 09:56:16 weight_utils.py:251] Using model weights format ['*.bin'] | |
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] | |
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 8.95it/s] | |
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 8.93it/s] | |
INFO 03-07 09:56:17 model_runner.py:1116] Loading model weights took 0.2389 GB | |
INFO 03-07 09:56:18 worker.py:266] Memory profiling takes 0.36 seconds | |
INFO 03-07 09:56:18 worker.py:266] the current vLLM instance can use total_gpu_memory (79.14GiB) x gpu_memory_utilization (0.90) = 71.22GiB | |
INFO 03-07 09:56:18 worker.py:266] model weights take 0.24GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 0.47GiB; the rest of the memory reserved for KV Cache is 70.43GiB. | |
INFO 03-07 09:56:18 executor_base.py:108] # CUDA blocks: 128204, # CPU blocks: 7281 | |
INFO 03-07 09:56:18 executor_base.py:113] Maximum concurrency for 2048 tokens per request: 1001.59x | |
INFO 03-07 09:56:21 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. | |
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:10<00:00, 3.42it/s] | |
INFO 03-07 09:56:32 model_runner.py:1563] Graph capturing finished in 10 secs, took 0.12 GiB | |
INFO 03-07 09:56:32 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 14.51 seconds | |
INFO 03-07 09:56:33 api_server.py:754] Using supplied chat template: | |
INFO 03-07 09:56:33 api_server.py:754] None | |
INFO 03-07 09:56:33 launcher.py:19] Available routes are: | |
INFO 03-07 09:56:33 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD | |
INFO 03-07 09:56:33 launcher.py:27] Route: /docs, Methods: GET, HEAD | |
INFO 03-07 09:56:33 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD | |
INFO 03-07 09:56:33 launcher.py:27] Route: /redoc, Methods: GET, HEAD | |
INFO 03-07 09:56:33 launcher.py:27] Route: /health, Methods: GET | |
INFO 03-07 09:56:33 launcher.py:27] Route: /ping, Methods: POST, GET | |
INFO 03-07 09:56:33 launcher.py:27] Route: /tokenize, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /detokenize, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /v1/models, Methods: GET | |
INFO 03-07 09:56:33 launcher.py:27] Route: /version, Methods: GET | |
INFO 03-07 09:56:33 launcher.py:27] Route: /v1/chat/completions, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /v1/completions, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /v1/embeddings, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /pooling, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /score, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /v1/score, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /rerank, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /v1/rerank, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /v2/rerank, Methods: POST | |
INFO 03-07 09:56:33 launcher.py:27] Route: /invocations, Methods: POST | |
INFO: Started server process [1] | |
INFO: Waiting for application startup. | |
INFO: Application startup complete. | |
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment