使用qwen2.5-vl-7b-instruct镜像跑qwen2.5-vl-7b-instruct模型对话测试报错_MindIE_昇腾论坛

# 使用qwen2.5-vl-7b-instruct镜像跑qwen2.5-vl-7b-instruct模型对话测试报错_MindIE_昇腾论坛

## 概述

本文档基于昇腾社区论坛帖子生成的技术教程。

**原始链接**: https://www.hiascend.com/forum/thread-0279187752695142047-1-1.html

**生成时间**: 2025-08-27 10:33:58

---

## 问题描述

使用镜像swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct 7.1.T2-800I-A2-aarch64 cbc1e2e038cf 8 weeks ago 14.8GB 使用模型https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct 参考文档https://www.hiascend.com/developer/ascendhub/detail/9eedc82e0c0644b2a2a9d0821ed5e7ad 启动命令 # 设置容器名称 export CONTAINER_NAME=qwen2.5-vl-7b-instruct # 选择镜像 export IMG_NAME=swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct:7.1.T2-800I-A2-aarch64 # 启动推理微服务使用ASCEND_VISIBLE_DEVICES选择卡号范围[07]示例选择0,1卡 docker run -itd \ --name=$CONTAINER_NAME \ -e ASCEND_VISIBLE_DEVICES=4,5 \ -e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default \ -e MIS_LIMIT_VIDEO_PER_PROMPT=1 \ -v $LOCAL_CACHE_PATH:/opt/mis/.cache \ -p 8000:8000 \ --shm-size 1gb \ $IMG_NAME复制 容器日志 INFO 07-12 06:46:15 [__init__.py:44] plugin ascend loaded. INFO 07-12 06:46:15 [__init__.py:230] Platform plugin ascend is activated WARNING 07-12 06:46:17 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vll...

## 相关代码

### 代码示例 1

```
# 设置容器名称
export CONTAINER_NAME=qwen2.5-vl-7b-instruct

# 选择镜像
export IMG_NAME=swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct:7.1.T2-800I-A2-aarch64

# 启动推理微服务，使用ASCEND_VISIBLE_DEVICES选择卡号，范围[0，7]，示例选择0,1卡
docker run -itd \
--name=$CONTAINER_NAME \
-e ASCEND_VISIBLE_DEVICES=4,5 \
-e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default \
-e MIS_LIMIT_VIDEO_PER_PROMPT=1 \
-v $LOCAL_CACHE_PATH:/opt/mis/.cache \
-p 8000:8000 \
--shm-size 1gb \
$IMG_NAME复制
```

### 代码示例 2

```
# 设置容器名称
export CONTAINER_NAME=qwen2.5-vl-7b-instruct

# 选择镜像
export IMG_NAME=swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct:7.1.T2-800I-A2-aarch64

# 启动推理微服务，使用ASCEND_VISIBLE_DEVICES选择卡号，范围[0，7]，示例选择0,1卡
docker run -itd \
--name=$CONTAINER_NAME \
-e ASCEND_VISIBLE_DEVICES=4,5 \
-e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default \
-e MIS_LIMIT_VIDEO_PER_PROMPT=1 \
-v $LOCAL_CACHE_PATH:/opt/mis/.cache \
-p 8000:8000 \
--shm-size 1gb \
$IMG_NAME
```

### 代码示例 3

```
INFO 07-12 06:46:15 [__init__.py:44] plugin ascend loaded.
INFO 07-12 06:46:15 [__init__.py:230] Platform plugin ascend is activated
WARNING 07-12 06:46:17 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 07-12 06:46:18 mis_launcher:8] Local model path is /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct
INFO 07-12 06:46:18 mis_launcher:8] Found model weight cached in path /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct, local model weight will be used
INFO 07-12 06:46:18 __init__.py:61] MIS API server
INFO 07-12 06:46:18 __init__.py:61] args: cache_path='/opt/mis/.cache' model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct' engine_type='vllm' served_model_name='Qwen2.5-VL-7B-Instruct' max_model_len=None enable_prefix_caching=False mis_config='atlas800ia2-2x32gb-bf16-vllm-default' host=None port=8000 inner_port=9090 ssl_keyfile=None ssl_certfile=None ssl_ca_certs=None ssl_cert_reqs=0 log_level='INFO' max_log_len=None disable_log_requests=False disable_log_stats=False api_key=None disable_fastapi_docs=False allowed_local_media_path='/opt' limit_image_per_prompt=0 limit_video_per_prompt=1 limit_audio_per_prompt=0 uvicorn_log_level='info' engine_optimization_config={'dtype': 'bfloat16', 'tensor_parallel_size': 2, 'pipeline_parallel_size': 1, 'distributed_executor_backend': 'mp', 'max_num_seqs': 128, 'max_model_len': 16384, 'max_num_batched_tokens': 16384, 'max_seq_len_to_capture': 16384, 'gpu_memory_utilization': 0.9, 'block_size': 32, 'swap_space': 4, 'cpu_offload_gb': 0, 'scheduling_policy': 'fcfs', 'enforce_eager': True}
INFO 07-12 06:46:18 contextlib.py:199] Using vllm backend
INFO 07-12 06:46:18 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 07-12 06:46:18 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 07-12 06:46:28 [arg_utils.py:1669] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default
INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default
INFO 07-12 06:46:28 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
WARNING 07-12 06:46:28 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(VllmWorkerProcess pid=273) INFO 07-12 06:46:28 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=273) WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc45030>
WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc44ee0>
INFO 07-12 06:46:49 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_a3682db0'), local_subscribe_addr='ipc:///tmp/63165690-a5fa-4d37-9f5f-c2d04efe8acd', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 07-12 06:46:49 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct...
(VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct...
(VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config []
INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorkerProcess pid=273) WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
(VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default
WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.73it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:01,  2.92it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:00,  2.14it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.82it/s]
(VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.66 seconds
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.83it/s]

INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.85 seconds
(VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [model_runner.py:948] Loading model weights took 7.8691 GB
INFO 07-12 06:46:53 [model_runner.py:948] Loading model weights took 7.8691 GB
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorkerProcess pid=273) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.)
hidden_states = hidden_states[window_index, :, :]
(VllmWorkerProcess pid=273) /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.)
(VllmWorkerProcess pid=273)   hidden_states = hidden_states[window_index, :, :]
/usr/local/lib/python3.10/dist-packages/torch_npu/distributed/distributed_c10d.py:117: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead.
warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.")
INFO 07-12 06:47:33 [executor_base.py:112] # npu blocks: 47616, # CPU blocks: 4681
INFO 07-12 06:47:33 [executor_base.py:117] Maximum concurrency for 16384 tokens per request: 93.00x
INFO 07-12 06:47:34 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 41.98 seconds
WARNING 07-12 06:47:36 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-12 06:47:36 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
INFO 07-12 06:47:36 [launcher.py:28] Available routes are:
INFO 07-12 06:47:36 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 07-12 06:47:36 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 07-12 06:47:36 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-12 06:47:36 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/models, Methods: GET
INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/chat/completions, Methods: POST
INFO:     Started server process [73]
INFO:     Waiting for application startup.
INFO:     Application startup complete.复制
```

### 代码示例 4

```
INFO 07-12 06:46:15 [__init__.py:44] plugin ascend loaded.
INFO 07-12 06:46:15 [__init__.py:230] Platform plugin ascend is activated
WARNING 07-12 06:46:17 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 07-12 06:46:18 mis_launcher:8] Local model path is /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct
INFO 07-12 06:46:18 mis_launcher:8] Found model weight cached in path /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct, local model weight will be used
INFO 07-12 06:46:18 __init__.py:61] MIS API server
INFO 07-12 06:46:18 __init__.py:61] args: cache_path='/opt/mis/.cache' model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct' engine_type='vllm' served_model_name='Qwen2.5-VL-7B-Instruct' max_model_len=None enable_prefix_caching=False mis_config='atlas800ia2-2x32gb-bf16-vllm-default' host=None port=8000 inner_port=9090 ssl_keyfile=None ssl_certfile=None ssl_ca_certs=None ssl_cert_reqs=0 log_level='INFO' max_log_len=None disable_log_requests=False disable_log_stats=False api_key=None disable_fastapi_docs=False allowed_local_media_path='/opt' limit_image_per_prompt=0 limit_video_per_prompt=1 limit_audio_per_prompt=0 uvicorn_log_level='info' engine_optimization_config={'dtype': 'bfloat16', 'tensor_parallel_size': 2, 'pipeline_parallel_size': 1, 'distributed_executor_backend': 'mp', 'max_num_seqs': 128, 'max_model_len': 16384, 'max_num_batched_tokens': 16384, 'max_seq_len_to_capture': 16384, 'gpu_memory_utilization': 0.9, 'block_size': 32, 'swap_space': 4, 'cpu_offload_gb': 0, 'scheduling_policy': 'fcfs', 'enforce_eager': True}
INFO 07-12 06:46:18 contextlib.py:199] Using vllm backend
INFO 07-12 06:46:18 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 07-12 06:46:18 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 07-12 06:46:28 [arg_utils.py:1669] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default
INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default
INFO 07-12 06:46:28 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
WARNING 07-12 06:46:28 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(VllmWorkerProcess pid=273) INFO 07-12 06:46:28 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=273) WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc45030>
WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc44ee0>
INFO 07-12 06:46:49 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_a3682db0'), local_subscribe_addr='ipc:///tmp/63165690-a5fa-4d37-9f5f-c2d04efe8acd', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 07-12 06:46:49 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct...
(VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct...
(VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config []
INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorkerProcess pid=273) WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
(VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default
WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.73it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:01,  2.92it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:00,  2.14it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.82it/s]
(VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.66 seconds
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.83it/s]

INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.85 seconds
(VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [model_runner.py:948] Loading model weights took 7.8691 GB
INFO 07-12 06:46:53 [model_runner.py:948] Loading model weights took 7.8691 GB
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorkerProcess pid=273) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.)
hidden_states = hidden_states[window_index, :, :]
(VllmWorkerProcess pid=273) /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.)
(VllmWorkerProcess pid=273)   hidden_states = hidden_states[window_index, :, :]
/usr/local/lib/python3.10/dist-packages/torch_npu/distributed/distributed_c10d.py:117: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead.
warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.")
INFO 07-12 06:47:33 [executor_base.py:112] # npu blocks: 47616, # CPU blocks: 4681
INFO 07-12 06:47:33 [executor_base.py:117] Maximum concurrency for 16384 tokens per request: 93.00x
INFO 07-12 06:47:34 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 41.98 seconds
WARNING 07-12 06:47:36 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-12 06:47:36 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
INFO 07-12 06:47:36 [launcher.py:28] Available routes are:
INFO 07-12 06:47:36 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 07-12 06:47:36 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 07-12 06:47:36 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-12 06:47:36 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/models, Methods: GET
INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/chat/completions, Methods: POST
INFO:     Started server process [73]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

### 代码示例 5

```
curl http://192.168.2.201:8000/openai/v1/models复制
```

## 相关图片

### 图片 1

![cke_612.png](https://fileserver.developer.huaweicloud.com/FileServer/getFile/cmtybbs/cc4/12e/958/39430f23a1cc412e95877834266352f7.20250720084551.15563410970905277726184222861520:20250827021045:2400:C96AE0FFA7C8C54D634248121B0DD560601624740ADFAD03EBB62E61864AC945.png)

**图片地址**: https://fileserver.developer.huaweicloud.com/FileServer/getFile/cmtybbs/cc4/12e/958/39430f23a1cc412e95877834266352f7.20250720084551.15563410970905277726184222861520:20250827021045:2400:C96AE0FFA7C8C54D634248121B0DD560601624740ADFAD03EBB62E61864AC945.png

**图片描述**: cke_612.png

## 完整内容

使用镜像swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct 7.1.T2-800I-A2-aarch64 cbc1e2e038cf 8 weeks ago 14.8GB 使用模型https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct 参考文档https://www.hiascend.com/developer/ascendhub/detail/9eedc82e0c0644b2a2a9d0821ed5e7ad 启动命令 # 设置容器名称 export CONTAINER_NAME=qwen2.5-vl-7b-instruct # 选择镜像 export IMG_NAME=swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct:7.1.T2-800I-A2-aarch64 # 启动推理微服务使用ASCEND_VISIBLE_DEVICES选择卡号范围[07]示例选择0,1卡 docker run -itd \ --name=$CONTAINER_NAME \ -e ASCEND_VISIBLE_DEVICES=4,5 \ -e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default \ -e MIS_LIMIT_VIDEO_PER_PROMPT=1 \ -v $LOCAL_CACHE_PATH:/opt/mis/.cache \ -p 8000:8000 \ --shm-size 1gb \ $IMG_NAME复制 容器日志 INFO 07-12 06:46:15 [__init__.py:44] plugin ascend loaded. INFO 07-12 06:46:15 [__init__.py:230] Platform plugin ascend is activated WARNING 07-12 06:46:17 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") INFO 07-12 06:46:18 mis_launcher:8] Local model path is /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct INFO 07-12 06:46:18 mis_launcher:8] Found model weight cached in path /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct, local model weight will be used INFO 07-12 06:46:18 __init__.py:61] MIS API server INFO 07-12 06:46:18 __init__.py:61] args: cache_path='/opt/mis/.cache' model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct' engine_type='vllm' served_model_name='Qwen2.5-VL-7B-Instruct' max_model_len=None enable_prefix_caching=False mis_config='atlas800ia2-2x32gb-bf16-vllm-default' host=None port=8000 inner_port=9090 ssl_keyfile=None ssl_certfile=None ssl_ca_certs=None ssl_cert_reqs=0 log_level='INFO' max_log_len=None disable_log_requests=False disable_log_stats=False api_key=None disable_fastapi_docs=False allowed_local_media_path='/opt' limit_image_per_prompt=0 limit_video_per_prompt=1 limit_audio_per_prompt=0 uvicorn_log_level='info' engine_optimization_config={'dtype': 'bfloat16', 'tensor_parallel_size': 2, 'pipeline_parallel_size': 1, 'distributed_executor_backend': 'mp', 'max_num_seqs': 128, 'max_model_len': 16384, 'max_num_batched_tokens': 16384, 'max_seq_len_to_capture': 16384, 'gpu_memory_utilization': 0.9, 'block_size': 32, 'swap_space': 4, 'cpu_offload_gb': 0, 'scheduling_policy': 'fcfs', 'enforce_eager': True} INFO 07-12 06:46:18 contextlib.py:199] Using vllm backend INFO 07-12 06:46:18 [__init__.py:30] Available plugins for group vllm.general_plugins: INFO 07-12 06:46:18 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. INFO 07-12 06:46:28 [arg_utils.py:1669] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine. INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default INFO 07-12 06:46:28 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, WARNING 07-12 06:46:28 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (VllmWorkerProcess pid=273) INFO 07-12 06:46:28 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks (VllmWorkerProcess pid=273) WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc45030> WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc44ee0> INFO 07-12 06:46:49 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_a3682db0'), local_subscribe_addr='ipc:///tmp/63165690-a5fa-4d37-9f5f-c2d04efe8acd', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 INFO 07-12 06:46:49 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct... (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct... (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] (VllmWorkerProcess pid=273) WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.73it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:00<00:01, 2.92it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.14it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.82it/s] (VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.66 seconds Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.67it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.83it/s] INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.85 seconds (VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [model_runner.py:948] Loading model weights took 7.8691 GB INFO 07-12 06:46:53 [model_runner.py:948] Loading model weights took 7.8691 GB Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. (VllmWorkerProcess pid=273) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.) hidden_states = hidden_states[window_index, :, :] (VllmWorkerProcess pid=273) /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.) (VllmWorkerProcess pid=273) hidden_states = hidden_states[window_index, :, :] /usr/local/lib/python3.10/dist-packages/torch_npu/distributed/distributed_c10d.py:117: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead. warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.") INFO 07-12 06:47:33 [executor_base.py:112] # npu blocks: 47616, # CPU blocks: 4681 INFO 07-12 06:47:33 [executor_base.py:117] Maximum concurrency for 16384 tokens per request: 93.00x INFO 07-12 06:47:34 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 41.98 seconds WARNING 07-12 06:47:36 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. INFO 07-12 06:47:36 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06} INFO 07-12 06:47:36 [launcher.py:28] Available routes are: INFO 07-12 06:47:36 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /docs, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /redoc, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/models, Methods: GET INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/chat/completions, Methods: POST INFO: Started server process [73] INFO: Waiting for application startup. INFO: Application startup complete.复制 获取模型信息 curl http://192.168.2.201:8000/openai/v1/models复制 返回 {"object":"list","data":[{"id":"Qwen2.5-VL-7B-Instruct","object":"model","created":1752370119,"owned_by":"vllm","max_model_len":16384}]}复制 对话测试 curl http://192.168.2.201:8000/openai/v1/chat/completions -X POST -d'{ "model": "Qwen2.5-7B-Instruct", "prompt": "解释量子力学基础概念", "max_tokens": 200, "temperature": 0.7 }'复制 报错 INFO: 192.168.2.201:40374 - "GET /openai/v1/models HTTP/1.1" 200 OK INFO: 192.168.2.201:41152 - "POST /openai/v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__ return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__ raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__ await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 714, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 734, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 73, in app response = await f(request) File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 291, in app solved_result = await solve_dependencies( File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 666, in solve_dependencies ) = await request_body_to_args( # body_params checked above File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 906, in request_body_to_args v_, errors_ = _validate_value_with_model_field( File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 706, in _validate_value_with_model_field v_, errors_ = field.validate(value, values, loc=loc) File "/usr/local/lib/python3.10/dist-packages/fastapi/_compat.py", line 129, in validate self._type_adapter.validate_python(value, from_attributes=True), File "/usr/local/lib/python3.10/dist-packages/pydantic/type_adapter.py", line 421, in validate_python return self.validator.validate_python( File "/opt/vllm-ascend/vllm/vllm/entrypoints/openai/protocol.py", line 56, in __log_extra_fields__ result = handler(data) File "/opt/vllm-ascend/vllm/vllm/entrypoints/openai/protocol.py", line 723, in check_generation_prompt if data.get("continue_final_message") and data.get( AttributeError: 'bytes' object has no attribute 'get'复制 您好请问下您使用的模型是Qwen2.5-VL-7B 还是 Qwen2.5-7B 模型使用的2.5vl7B配置文件的模型名称配置不严谨了问题已经通过使用MindIE容器+2.5VL适配代码解决了 -e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default 310p3 用什么参数

---

## 技术要点总结

基于以上内容，主要技术要点包括：

1. **问题类型**: 错误处理
2. **涉及技术**: MindIE, HTTPS, SSL, GPU, NPU, Docker, Atlas, 昇腾, CANN, AI
3. **解决方案**: 请参考完整内容中的解决方案

## 相关资源

- 昇腾社区: https://www.hiascend.com/
- 昇腾论坛: https://www.hiascend.com/forum/

---

*本文档由AI自动生成，仅供参考。如有疑问，请参考原始帖子。*