DeepSeek 对 Red Hat AI Inference Server模型不适应GPU的能力的诊断
您当前尝试的模型需要较新的GPU架构。最简单的解决方法是更换为。
经测试,换模型就可以解决问题。DeepSeek判断准确。
问题
比之前好像有进步,但是还是有错误:
[root@rhel-9 work]# podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--group-add=video --group-add=render \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env VLLM_LOGGING_LEVEL=DEBUG \
-v /rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 1
DEBUG 03-03 08:15:26 [plugins/__init__.py:35] No plugins for group vllm.platform_plugins found.
DEBUG 03-03 08:15:26 [platforms/__init__.py:36] Checking if TPU platform is available.
DEBUG 03-03 08:15:26 [platforms/__init__.py:55] TPU platform is not available because: No module named 'libtpu'
DEBUG 03-03 08:15:26 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 03-03 08:15:26 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 03-03 08:15:26 [platforms/__init__.py:112] Checking if ROCm platform is available.
DEBUG 03-03 08:15:26 [platforms/__init__.py:126] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 03-03 08:15:26 [platforms/__init__.py:133] Checking if XPU platform is available.
DEBUG 03-03 08:15:26 [platforms/__init__.py:153] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 03-03 08:15:26 [platforms/__init__.py:160] Checking if CPU platform is available.
DEBUG 03-03 08:15:26 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 03-03 08:15:26 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 03-03 08:15:26 [platforms/__init__.py:228] Automatically detected platform cuda.
DEBUG 03-03 08:15:30 [plugins/__init__.py:43] Available plugins for group vllm.general_plugins:
DEBUG 03-03 08:15:30 [plugins/__init__.py:45] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 03-03 08:15:30 [plugins/__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(APIServer pid=1) INFO 03-03 08:15:30 [entrypoints/openai/api_server.py:1346] vLLM API server version 0.13.0+rhai11
(APIServer pid=1) INFO 03-03 08:15:30 [entrypoints/utils.py:265] non-default args: {'model': 'RedHatAI/Llama-3.2-1B-Instruct-FP8'}
config.json: 2.04kB [00:00, 8.22MB/s]
(APIServer pid=1) DEBUG 03-03 08:15:32 [model_executor/models/registry.py:633] Cached model info file for class vllm.model_executor.models.llama.LlamaForCausalLM not found
(APIServer pid=1) DEBUG 03-03 08:15:32 [model_executor/models/registry.py:693] Cache model info for class vllm.model_executor.models.llama.LlamaForCausalLM miss. Loading model instead.
(APIServer pid=1) DEBUG 03-03 08:15:41 [model_executor/models/registry.py:703] Loaded model info for class vllm.model_executor.models.llama.LlamaForCausalLM
(APIServer pid=1) DEBUG 03-03 08:15:41 [utils/import_utils.py:85] Loading module triton_kernels from /opt/app-root/lib64/python3.12/site-packages/vllm/third_party/triton_kernels/__init__.py.
(APIServer pid=1) DEBUG 03-03 08:15:41 [logging_utils/log_time.py:29] Registry inspect model class: Elapsed time 9.0211376 secs
(APIServer pid=1) INFO 03-03 08:15:41 [config/model.py:514] Resolved architecture: LlamaForCausalLM
(APIServer pid=1) WARNING 03-03 08:15:41 [config/model.py:1955] Your device 'Tesla T4' (with compute capability 7.5) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=1) WARNING 03-03 08:15:41 [config/model.py:2005] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 03-03 08:15:41 [config/model.py:1661] Using max model len 131072
(APIServer pid=1) DEBUG 03-03 08:15:41 [_ipex_ops.py:15] Import error msg: No module named 'intel_extension_for_pytorch'
(APIServer pid=1) DEBUG 03-03 08:15:41 [config/model.py:1718] Generative models support chunked prefill.
(APIServer pid=1) DEBUG 03-03 08:15:41 [config/model.py:1770] Generative models support prefix caching.
(APIServer pid=1) DEBUG 03-03 08:15:41 [engine/arg_utils.py:1860] Enabling chunked prefill by default
(APIServer pid=1) DEBUG 03-03 08:15:41 [engine/arg_utils.py:1890] Enabling prefix caching by default
(APIServer pid=1) DEBUG 03-03 08:15:41 [engine/arg_utils.py:1968] Defaulting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
(APIServer pid=1) DEBUG 03-03 08:15:41 [engine/arg_utils.py:1978] Defaulting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
(APIServer pid=1) INFO 03-03 08:15:41 [config/scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) DEBUG 03-03 08:15:41 [plugins/__init__.py:35] No plugins for group vllm.stat_logger_plugins found.
(APIServer pid=1) DEBUG 03-03 08:15:42 [tokenizers/registry.py:63] Loading CachedHfTokenizer for tokenizer_mode='hf'
tokenizer_config.json: 54.6kB [00:00, 27.9MB/s]
tokenizer.json: 9.09MB [00:00, 20.1MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 325/325 [00:00<00:00, 2.97MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 1.76MB/s]
(APIServer pid=1) DEBUG 03-03 08:15:48 [plugins/io_processors/__init__.py:33] No IOProcessor plugins requested by the model
DEBUG 03-03 08:15:52 [plugins/__init__.py:35] No plugins for group vllm.platform_plugins found.
DEBUG 03-03 08:15:52 [platforms/__init__.py:36] Checking if TPU platform is available.
DEBUG 03-03 08:15:52 [platforms/__init__.py:55] TPU platform is not available because: No module named 'libtpu'
DEBUG 03-03 08:15:52 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 03-03 08:15:52 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 03-03 08:15:52 [platforms/__init__.py:112] Checking if ROCm platform is available.
DEBUG 03-03 08:15:52 [platforms/__init__.py:126] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 03-03 08:15:52 [platforms/__init__.py:133] Checking if XPU platform is available.
DEBUG 03-03 08:15:52 [platforms/__init__.py:153] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 03-03 08:15:52 [platforms/__init__.py:160] Checking if CPU platform is available.
DEBUG 03-03 08:15:52 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 03-03 08:15:52 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 03-03 08:15:52 [platforms/__init__.py:228] Automatically detected platform cuda.
DEBUG 03-03 08:15:56 [utils/import_utils.py:85] Loading module triton_kernels from /opt/app-root/lib64/python3.12/site-packages/vllm/third_party/triton_kernels/__init__.py.
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:56 [v1/engine/core.py:803] Waiting for init message from front-end.
(APIServer pid=1) DEBUG 03-03 08:15:56 [v1/engine/utils.py:1063] HELLO from local core engine process 0.
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:56 [v1/engine/core.py:814] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['ipc:///tmp/4b9b175b-a832-440d-a469-5d7776f78902'], outputs=['ipc:///tmp/d0e8d14d-1650-40b8-8ad0-36d077fbe951'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={'data_parallel_master_ip': '127.0.0.1', 'data_parallel_master_port': 0, '_data_parallel_master_port_list': [], 'data_parallel_size': 1}, parallel_config_hash=None)
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:56 [v1/engine/core.py:623] Has DP Coordinator: False, stats publish address: None
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:56 [plugins/__init__.py:43] Available plugins for group vllm.general_plugins:
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:56 [plugins/__init__.py:45] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:56 [plugins/__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(EngineCore_DP0 pid=92) INFO 03-03 08:15:56 [v1/engine/core.py:93] Initializing a V1 LLM engine (v0.13.0+rhai11) with config: model='RedHatAI/Llama-3.2-1B-Instruct-FP8', speculative_config=None, tokenizer='RedHatAI/Llama-3.2-1B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=RedHatAI/Llama-3.2-1B-Instruct-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=92) /opt/app-root/lib64/python3.12/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /mnt/work-dir/torch-2.9.0/torch-2.9.0/aten/src/ATen/Context.cpp:80.)
(EngineCore_DP0 pid=92) _C._set_float32_matmul_precision(precision)
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:57 [distributed/parallel_state.py:1161] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.88.0.6:37571 backend=nccl
(EngineCore_DP0 pid=92) INFO 03-03 08:15:57 [distributed/parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.88.0.6:37571 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:57 [distributed/parallel_state.py:1247] Detected 1 nodes in the distributed environment
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=92) INFO 03-03 08:15:57 [distributed/parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:57 [compilation/decorators.py:194] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.deepseek_v2.DeepseekV2Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:57 [compilation/decorators.py:194] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:57 [attention/utils/fa_utils.py:73] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:57 [v1/sample/ops/topk_topp_sampler.py:53] FlashInfer top-p/top-k sampling is available but disabled by default. Set VLLM_USE_FLASHINFER_SAMPLER=1 to opt in after verifying accuracy for your workloads.
(EngineCore_DP0 pid=92) DEBUG 03-03 08:15:57 [v1/sample/logits_processor/__init__.py:63] No logitsprocs plugins installed (group vllm.logits_processors).
(EngineCore_DP0 pid=92) INFO 03-03 08:15:57 [v1/worker/gpu_model_runner.py:3562] Starting to load model RedHatAI/Llama-3.2-1B-Instruct-FP8...
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] super().__init__(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self._init_executor()
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.driver_worker.load_model()
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.model = model_loader.load_model(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] model = initialize_model(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 566, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.model = self._init_model(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 611, in _init_model
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] return LlamaModel(vllm_config=vllm_config, prefix=prefix, layer_type=layer_type)
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/compilation/decorators.py", line 291, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] old_init(self, **kwargs)
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 393, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 395, in <lambda>
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] lambda prefix: layer_type(vllm_config=vllm_config, prefix=prefix),
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 302, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.self_attn = LlamaAttention(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 165, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.qkv_proj = QKVParallelLinear(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 935, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] super().__init__(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 467, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] super().__init__(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 283, in __init__
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self.quant_method = quant_config.get_quant_method(self, prefix=prefix)
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 158, in get_quant_method
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] quant_scheme = self.get_scheme(layer=layer, layer_name=prefix)
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 760, in get_scheme
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] self._check_scheme_supported(scheme.get_min_capability())
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 324, in _check_scheme_supported
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] raise RuntimeError(
(EngineCore_DP0 pid=92) ERROR 03-03 08:15:58 [v1/engine/core.py:866] RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.')
(EngineCore_DP0 pid=92) Process EngineCore_DP0:
(EngineCore_DP0 pid=92) Traceback (most recent call last):
(EngineCore_DP0 pid=92) File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=92) self.run()
(EngineCore_DP0 pid=92) File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=92) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=92) raise e
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=92) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=92) super().__init__(
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=92) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=92) self._init_executor()
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=92) self.driver_worker.load_model()
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=92) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=92) self.model = model_loader.load_model(
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(EngineCore_DP0 pid=92) model = initialize_model(
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=92) return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 566, in __init__
(EngineCore_DP0 pid=92) self.model = self._init_model(
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 611, in _init_model
(EngineCore_DP0 pid=92) return LlamaModel(vllm_config=vllm_config, prefix=prefix, layer_type=layer_type)
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/compilation/decorators.py", line 291, in __init__
(EngineCore_DP0 pid=92) old_init(self, **kwargs)
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 393, in __init__
(EngineCore_DP0 pid=92) self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=92) maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 395, in <lambda>
(EngineCore_DP0 pid=92) lambda prefix: layer_type(vllm_config=vllm_config, prefix=prefix),
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 302, in __init__
(EngineCore_DP0 pid=92) self.self_attn = LlamaAttention(
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/models/llama.py", line 165, in __init__
(EngineCore_DP0 pid=92) self.qkv_proj = QKVParallelLinear(
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 935, in __init__
(EngineCore_DP0 pid=92) super().__init__(
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 467, in __init__
(EngineCore_DP0 pid=92) super().__init__(
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 283, in __init__
(EngineCore_DP0 pid=92) self.quant_method = quant_config.get_quant_method(self, prefix=prefix)
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 158, in get_quant_method
(EngineCore_DP0 pid=92) quant_scheme = self.get_scheme(layer=layer, layer_name=prefix)
(EngineCore_DP0 pid=92) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 760, in get_scheme
(EngineCore_DP0 pid=92) self._check_scheme_supported(scheme.get_min_capability())
(EngineCore_DP0 pid=92) File "/opt/app-root/lib64/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 324, in _check_scheme_supported
(EngineCore_DP0 pid=92) raise RuntimeError(
(EngineCore_DP0 pid=92) RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.')
[rank0]:[W303 08:15:59.368654524 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1) File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1464, in <module>
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib64/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1393, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1412, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 215, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 820, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib64/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
[root@rhel-9 work]#
解答
您遇到的错误现在已经明确:您使用的模型 RedHatAI/Llama-3.2-1B-Instruct-FP8 采用了 FP8 量化,而您的 GPU(Tesla T4,计算能力 7.5)不支持 FP8(需要计算能力 ≥ 8.0 的 GPU,如 A100、H100、L4 等)。因此容器启动失败。
错误核心信息
text
RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.')
-
最低要求:计算能力 8.0(对应 Ampere 架构或更新)
-
当前 GPU:Tesla T4(Turing 架构,计算能力 7.5)
解决方案
1. 更换为兼容的模型(推荐)
选择一个不需要 FP8 的模型,例如原生的 Llama-3.2-1B-Instruct(FP16)或其他适合 T4 的模型。
示例命令(使用 Meta 官方模型):
bash
podman run --rm -it \ --device nvidia.com/gpu=all \ --security-opt=label=disable \ --shm-size=4g -p 8000:8000 \ --group-add=video --group-add=render \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" \ -v /rhaiis-cache:/opt/app-root/src/.cache:Z \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \ --model meta-llama/Llama-3.2-1B-Instruct
注意:
-
需要拥有 Meta 模型的访问权限,并在 Hugging Face 上登录并生成 token。
-
确保环境变量
HF_TOKEN已正确设置(export HF_TOKEN="hf_xxx")。
如果您希望使用其他模型,也可以指定 Hugging Face 上的任何兼容模型,例如:
-
mistralai/Mistral-7B-Instruct-v0.3 -
TinyLlama/TinyLlama-1.1B-Chat-v1.0
2. 检查红帽是否提供 T4 兼容的模型版本
红帽可能还提供了 INT8 或 FP16 版本的模型,命名可能不同(例如不带 -FP8 后缀)。您可以查阅红帽的模型目录或文档,寻找适合 T4 的版本。
3. 升级 GPU(如果条件允许)
如果必须使用 FP8 模型,则需要更换为支持 FP8 的 GPU,例如:
-
NVIDIA A100 / A30(计算能力 8.0)
-
NVIDIA H100(计算能力 9.0)
-
NVIDIA L4(计算能力 8.9)
-
等等。
4. 降级使用 CPU 推理(不推荐)
您可以强制使用 CPU 运行,但推理速度极慢,仅用于测试:
bash
--device nvidia.com/gpu=all # 去掉这一行,或者改为 --device nvidia.com/gpu=none
但 vLLM 主要针对 GPU 优化,CPU 模式可能不稳定。
验证 GPU 计算能力
您可以通过以下命令查看 GPU 计算能力:
bash
nvidia-smi --query-gpu=compute_cap --format=csv
输出示例:
text
compute_cap 7.5
这表明您的 T4 计算能力为 7.5,确实不支持 FP8。
总结
您当前尝试的模型需要较新的 GPU 架构。最简单的解决方法是更换为非 FP8 模型,例如 Meta 官方的 Llama-3.2-1B-Instruct(FP16)。请确保您的 Hugging Face token 有效,并按照上述命令重新运行即可。
更多推荐


所有评论(0)