MindIE-ICU
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
测试文档
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
npu-smi info 相关报错- dcmi module initialize failed_MindIE_昇腾论坛
鲲鹏服务器,300I-Duo(310P3)npu卡,部署Qwen3-32B,容器内启动 ./mindieservice_daemon报权限错误_MindIE_昇腾论坛
今天刚到的算力服务器:300I-Duo(310P3)npu卡, 容器内启动 ./mindieservice_daemon,日志是什么意思?_MindIE_昇腾论坛
deepseekR1模型的性能需要怎么优化_MindIE_昇腾论坛
真没招了搞了一天了解决不了,求助求助大佬来!!!EZ9999: [PID: 28402] 2025-08-24-16:31:00.300.443 Parse dynamic kernel config fail. Trace_MindIE_昇腾论坛
求Atlas 800 Mindie部署Qwen2.5VL32B的教程_MindIE_昇腾论坛
大模型部署_MindIE_昇腾论坛
MindIE 2.1RC1部署qwen2.5-vl-7B-Instruct并发推理20张以上图片报错_MindIE_昇腾论坛
import mindiesd,报错:libPTAExtensionOPS.so: undefined symbol_MindIE_昇腾论坛
单卡310P3使用镜像跑deepsee 7B报出Warning: EZ9999: Inner Error! EZ9999: [PID: 1338] _MindIE_昇腾论坛
310P3显卡跑Qwen2.5-7b-vl报错,参考https://www.hiascend.com/developer/ascendhub/detail/9eedc82e0c0644b2a2a9d0821ed5e7ad_MindIE_昇腾论坛
mindieservice_daemon 启动报错, NPU out of memory (PyTorch)_MindIE_昇腾论坛
Qwen3-Reranker-4B重排序模型能够正常启动,访问重排序模型接口时报错,NPU out of memory,请问该如何解决_MindIE_昇腾论坛
Qwen3-Reranker-4B重排序模型能够正常启动,访问重排序模型接口时报错,NPU out of memory,请问该如何解决_MindIE_昇腾论坛
vit-b-16镜像服务启动成功之后的调用命令例子不对_MindIE_昇腾论坛
跑2张310P卡,使用mindie镜像,npu-smi info显示的设备不是0,1_MindIE_昇腾论坛
2.1.RC1-300I-Duo-py311-openeuler24.03-lts镜像无法下载_MindIE_昇腾论坛
模型答非所问_MindIE_昇腾论坛
mindIE 设置system prompt以及和ollama的一些参数对应问题_MindIE_昇腾论坛
Atlas 300i Duo上部署qwen3-32b后启动失败_MindIE_昇腾论坛
Atlas 300i Duo上部署qwen3-32b后启动失败_MindIE_昇腾论坛
拉起mindIE服务(mindieservice_daemon)出错_MindIE_昇腾论坛
MindIE推理DeepSeek R1 0528 BF16乱码_MindIE_昇腾论坛
Atlas 500 A2 智能小站 ffmpeg硬件编解码报错_MindIE_昇腾论坛
MindIE镜像无法安装vllm依赖_MindIE_昇腾论坛
在使用mis_tei:7.1.RC1-300I-Duo-aarch64镜像时报错,illegal_MindIE_昇腾论坛
银河麒麟V10 SP1运行mindie报错_MindIE_昇腾论坛
qwen2.5-vl图片处理报错_MindIE_昇腾论坛
mindie部署模型后,并发多的情况下请求报错_MindIE_昇腾论坛
在2块Atlas 300i pro 上,开启mindie-service服务,跑Qwen3-8b,速度较慢 ,请问有加速方案吗? _MindIE_昇腾论坛
部署mindie docker时,应该选用什么操作系统,麒麟还是欧拉?_MindIE_昇腾论坛
mindieserver提供的大模型服务化接口如何配置API key_MindIE_昇腾论坛
Ascend310P1的OrangePi AI Studio使用mindIE镜像报错ImportError: /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib/libatb.so: un_MindIE_昇腾论坛
Embedding、Rerank部署报错_MindIE_昇腾论坛
Tokenizer encode wait sub process timeout. errno is 110_MindIE_昇腾论坛
310p运行vllm-ascend报[ERROR] 2025-08-12-01:31:20 (PID:909, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception_MindIE_昇腾论坛
MindIE 2.1.rc1启动Qwen3-30B-A3B报错_MindIE_昇腾论坛
mindie推理qwen2.5-vl报错_MindIE_昇腾论坛
https安装证书失败_MindIE_昇腾论坛
S500C+300IPRO 部署gpustack ,适应mindeie 部署qwen 2.5 7B instruct 启动不了_MindIE_昇腾论坛
求助,300i duo卡使用msmodelslim中的w8a8量化,用mindie部署报错_MindIE_昇腾论坛
【MindIE】【接口疑问】服务管控指标查询接口如果有多个模型时,指标值是多个模型汇总的结果吗_MindIE_昇腾论坛
最新镜像2.1.RC1-300I-Duo-py311-openeuler24.03-lts 部署Qwen2.5-VL-7B,文本问答正常,推理图像报错_MindIE_昇腾论坛
MindIE 2.1.RC1版本中,支持Qwen3-32B 在300IDUO下的稀疏量化问题_MindIE_昇腾论坛
部署qwen3-32b模型tool_call功能异常,急急急_MindIE_昇腾论坛
MindIE MS Coordinate使用多模态的方式访问报[ERROR] [RequestListener.cpp:482] : [MIE03E200C00] [RequestListener] Failed to deal MindI_MindIE_昇腾论坛
mindie2.1启动qwen3:32b报错llminferengine failed to init llminfermodels_MindIE_昇腾论坛
export MIES_SERVICE_MONITOR_MODE=1设置需重启服务中断业务才能生效,有规避的方法吗?_MindIE_昇腾论坛
2.1.RC1-300I-Duo-py311-openeuler24.03-lts在300iDUO卡部署Qwen3-30B-A3B模型报错_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
deepseekR1模型的性能需要怎么优化_MindIE_昇腾论坛
真没招了搞了一天了解决不了,求助求助大佬来!!!EZ9999: [PID: 28402] 2025-08-24-16:31:00.300.443 Parse dynamic kernel config fail. Trace_MindIE_昇腾论坛
真没招了搞了一天了解决不了,求助求助大佬来!!!EZ9999: [PID: 28402] 2025-08-24-16:31:00.300.443 Parse dynamic kernel config fail. Trace_MindIE_昇腾论坛
真没招了搞了一天了解决不了,求助求助大佬来!!!EZ9999: [PID: 28402] 2025-08-24-16:31:00.300.443 Parse dynamic kernel config fail. Trace_MindIE_昇腾论坛
真没招了搞了一天了解决不了,求助求助大佬来!!!EZ9999: [PID: 28402] 2025-08-24-16:31:00.300.443 Parse dynamic kernel config fail. Trace_MindIE_昇腾论坛
310P3显卡跑Qwen2.5-7b-vl报错,参考https://www.hiascend.com/developer/ascendhub/detail/9eedc82e0c0644b2a2a9d0821ed5e7ad_MindIE_昇腾论坛
mindieservice_daemon 启动报错, NPU out of memory (PyTorch)_MindIE_昇腾论坛
Qwen3-Reranker-4B重排序模型能够正常启动,访问重排序模型接口时报错,NPU out of memory,请问该如何解决_MindIE_昇腾论坛
2.1.RC1-300I-Duo-py311-openeuler24.03-lts镜像无法下载_MindIE_昇腾论坛
Atlas 300i Duo上部署qwen3-32b后启动失败_MindIE_昇腾论坛
Atlas 500 A2 智能小站 ffmpeg硬件编解码报错_MindIE_昇腾论坛
最新镜像2.1.RC1-300I-Duo-py311-openeuler24.03-lts 部署Qwen2.5-VL-7B,文本问答正常,推理图像报错_MindIE_昇腾论坛
部署qwen3-32b模型tool_call功能异常,急急急_MindIE_昇腾论坛
MindIE 启动卡死,日志停滞在 model_runner.dtype,NPU 进程存在但模型未加载_MindIE_昇腾论坛
求助谁用最新的mindie2.1rc1在300i-duo的卡下面跑通qwen3的moe模型了?_MindIE_昇腾论坛
请求快速支持 gpt-oss-120b 和 gpt-oss-20b 模型_MindIE_昇腾论坛
MindIE 支持 msmodelslim 量化后的模型嘛_MindIE_昇腾论坛
有偿求助帮忙在autodl上部署MindIE Server_MindIE_昇腾论坛
MindIE如何一卡推理两个语言模型?_MindIE_昇腾论坛
w cassin baixar_MindIE_昇腾论坛
适配微调后的MiniCPM-V-2_6,总是出现错误:Failed to get vocab size from tokenizer wrapper with exception_MindIE_昇腾论坛
求助,现在mindie哪个版本支持部署qwen2.5-vl-72B,有没有部署指导文档,谢谢。_MindIE_昇腾论坛
最新版的2.1.RC1-800I-A2-py311-openeuler24.03-lts 部署Qwen3-235B-A22B (2台Atlas 800I A2 推理)目前推理速度10 token/s_MindIE_昇腾论坛
Atlas 300I Duo8卡运行DeepSeek-R1-Distill-Llama-70B异常_MindIE_昇腾论坛
《急贴求大神回答!!!》800TA2宿主机部署Deepseek-R1、Qwen3-235B、Qwen3-32B推理服务报错!!!!_MindIE_昇腾论坛
qwen2.5VL 72B模型部署后无法确保结果可复现_MindIE_昇腾论坛
VLM 模型是否支持w8a16 量化_MindIE_昇腾论坛
910B 8卡 部署Qwen3-32B模型,模型启动报ConnectionRefusedError: [Errno 111] Connection refused错误_MindIE_昇腾论坛
910B 8卡 部署Qwen3-32B模型,模型启动报ConnectionRefusedError: [Errno 111] Connection refused错误_MindIE_昇腾论坛
910B 8卡 部署Qwen3-32B模型,模型启动报ConnectionRefusedError: [Errno 111] Connection refused错误_MindIE_昇腾论坛
之前昇腾产品公告中提到7月份会升级mindie到2.1rc1版本,什么时候能试用啊?_MindIE_昇腾论坛
MindIE 部署JanusPro7B 识别返回乱码 _MindIE_昇腾论坛
MindIE双机部署Qwen3-235B-A22B-Instruct-2507模型报错:Failed to get vocab size from tokenizer wrapper with exception_MindIE_昇腾论坛
mindie2.0.RC2版本运行GLM-4.1V-9B-Thinking报错是不支持吗_MindIE_昇腾论坛
多模型部署报错了_MindIE_昇腾论坛
atlas300 卡,mindie 或者 gpustack 启动本地 llm , 怎么对 本地部署的 ragflow 等 RAG 应用进行测评呢?_MindIE_昇腾论坛
dev-2.0.T17.B010-800I-A2-py311-ubuntu22.04-aarch64 部署Qwen3-Coder 失败_MindIE_昇腾论坛
310P3 运行Qwen3/Deepseek-R1的性能_MindIE_昇腾论坛
求助DeepSeek-R1-Distill-Llama-70B-W8A8-NPU谁有这个能下载的模型啊,两张300iDUO还有模型的优化配置_MindIE_昇腾论坛
mindie:2.0.RC2-300I-Duo-py311-openeuler24.03-lts请问如何去升级旧的变量_MindIE_昇腾论坛
镜像文件中使用 ./bin/mindieservice_daemon报错:_MindIE_昇腾论坛
qwen2.5-72b-instruct运行中频繁卡顿:ai core 利用率总是莫名达到100%,然后模型推理卡住,持续约6-7分钟后释放_MindIE_昇腾论坛
大EP场景下运行,D节点夯住,提升Slave waiting for master init flag_MindIE_昇腾论坛
四机部署DeepSeek-R1-0528-bf16问题与解决方案_MindIE_昇腾论坛
DeepSeek-V3-w8a8双机直连部署启MindIE服务化报错_MindIE_昇腾论坛
登录镜像仓库报错_MindIE_昇腾论坛
使用64G双机800I A2部署int8量化的deepseekR1最长上下文是多少?config.json的参数应该如何配置?_MindIE_昇腾论坛
部署qwen2.5vl-32b时,报错 Exception:call aclnnArange failed, detail:EZ9999: Inner Error!_MindIE_昇腾论坛
双机32卡910B (MindIE 2.0 rc2) 部署DeepSeek-R1-w8a8,性能优化遇到瓶颈(32卡吞吐仅780 tok/s vs 单机16卡680 tok/s),求tpdp,moetpdp参数设置建议_MindIE_昇腾论坛
MindIE 运行DeepSeek-R1-Distill-Qwen-32B 无法启动_MindIE_昇腾论坛
MindIE 运行DeepSeek-R1-Distill-Qwen-32B 无法启动_MindIE_昇腾论坛
MindIE 运行DeepSeek-R1-Distill-Qwen-32B 无法启动_MindIE_昇腾论坛
300iduo,一张310P3 96G的卡,mindie可以跑qwen2.5vl吗?_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
求Atlas 800 Mindie部署Qwen2.5VL32B的教程_MindIE_昇腾论坛
310P3显卡跑Qwen2.5-7b-vl报错,参考https://www.hiascend.com/developer/ascendhub/detail/9eedc82e0c0644b2a2a9d0821ed5e7ad_MindIE_昇腾论坛
2.1.RC1-300I-Duo-py311-openeuler24.03-lts镜像无法下载_MindIE_昇腾论坛
MindIE镜像无法安装vllm依赖_MindIE_昇腾论坛
MindIE镜像无法安装vllm依赖_MindIE_昇腾论坛
MindIE 2.1.rc1启动Qwen3-30B-A3B报错_MindIE_昇腾论坛
MindIE 2.1.rc1启动Qwen3-30B-A3B报错_MindIE_昇腾论坛
mindie推理qwen2.5-vl报错_MindIE_昇腾论坛
mindie推理qwen2.5-vl报错_MindIE_昇腾论坛
mindie推理qwen2.5-vl报错_MindIE_昇腾论坛
MindIE 2.1.RC1版本中,支持Qwen3-32B 在300IDUO下的稀疏量化问题_MindIE_昇腾论坛
MindIE如何一卡推理两个语言模型?_MindIE_昇腾论坛
最新版的2.1.RC1-800I-A2-py311-openeuler24.03-lts 部署Qwen3-235B-A22B (2台Atlas 800I A2 推理)目前推理速度10 token/s_MindIE_昇腾论坛
最新版的2.1.RC1-800I-A2-py311-openeuler24.03-lts 部署Qwen3-235B-A22B (2台Atlas 800I A2 推理)目前推理速度10 token/s_MindIE_昇腾论坛
910B 8卡 部署Qwen3-32B模型,模型启动报ConnectionRefusedError: [Errno 111] Connection refused错误_MindIE_昇腾论坛
镜像文件中使用 ./bin/mindieservice_daemon报错:_MindIE_昇腾论坛
使用64G双机800I A2部署int8量化的deepseekR1最长上下文是多少?config.json的参数应该如何配置?_MindIE_昇腾论坛
使用64G双机800I A2部署int8量化的deepseekR1最长上下文是多少?config.json的参数应该如何配置?_MindIE_昇腾论坛
MindIE 运行DeepSeek-R1-Distill-Qwen-32B 无法启动_MindIE_昇腾论坛
MindIE多机多卡推理,是否支持使用部分卡,而不是整机8张卡都使用_MindIE_昇腾论坛
MindIE部署Qwen2-audio模型怎么调用?_MindIE_昇腾论坛
MindIE-Service 部署的推理服务是不是不能上传附件比如pdf_MindIE_昇腾论坛
在Atlas 300I Duo 96GB卡上如何部署SGLang,没有找到相关的文档_MindIE_昇腾论坛
MindIE是否支持bge-reranker-V2-m3模型_MindIE_昇腾论坛
《急帖求回》宿主机怎么部署Qwen3系列模型_MindIE_昇腾论坛
mindie2.0用2卡910B推理Qwen2.5-VL-3B正常,而切换为单卡推理报错_MindIE_昇腾论坛
能否增加下 m3e-large 的emb镜像呢_MindIE_昇腾论坛
910B3支持的量化类型_MindIE_昇腾论坛
910B2求助_MindIE_昇腾论坛
Qwen2.5-VL-72B-Instruct模型分析图片报错{"error":"Failed to get engine response.","error_type":"Incomplete Generation"}_MindIE_昇腾论坛
咨询关于mindIE支持模型的几个问题_MindIE_昇腾论坛
mindie下载申请能不能快一些通过啊!_MindIE_昇腾论坛
910A用MindIE部署Qwen2.5-VL-7B-Instruct报错_MindIE_昇腾论坛
910A用MindIE部署Qwen2.5-VL-7B-Instruct报错_MindIE_昇腾论坛
基于MindIE2.0.RC2在300I Duo上单卡运行qwen2.5-vl-7b报错_MindIE_昇腾论坛
300iDUO适配MiniCPM-V-2_6微调后大模型失败_MindIE_昇腾论坛
300iDUO适配MiniCPM-V-2_6微调后大模型失败_MindIE_昇腾论坛
bge-large-zh-v1.5模型部署后调用,出现 Bad Request Error, 413 Client Error_MindIE_昇腾论坛
bge-large-zh-v1.5模型部署后调用,出现 Bad Request Error, 413 Client Error_MindIE_昇腾论坛
bge-large-zh-v1.5模型部署后调用,出现 Bad Request Error, 413 Client Error_MindIE_昇腾论坛
2.0.RC2-300I-Duo-py311-openeuler24.03-lts镜像是否支持qwen3_MindIE_昇腾论坛
atb-llm面壁大模型权重转换报错ValueError: safe_get_model_from_pretrained failed._MindIE_昇腾论坛
MiniCPM-V2.6-8Bmindie部署指导如何获取?_MindIE_昇腾论坛
MiniCPM-V2.6-8Bmindie部署指导如何获取?_MindIE_昇腾论坛
大模型推理时常受限_MindIE_昇腾论坛
使用qwen2.5-vl-7b-instruct镜像跑qwen2.5-vl-7b-instruct模型对话测试报错_MindIE_昇腾论坛
使用qwen2.5-vl-7b-instruct镜像跑qwen2.5-vl-7b-instruct模型对话测试报错_MindIE_昇腾论坛
使用mis-tei驱动rerenk模型时存在的问题_MindIE_昇腾论坛
MindIE啥时候能支持GLM-4.1V-9B-Thinking模型呀,有具体计划吗?_MindIE_昇腾论坛
大家两张300I duo 跑DeepSeek-R1-Distill-Qwen-32B 的速度怎么样?_MindIE_昇腾论坛
大家两张300I duo 跑DeepSeek-R1-Distill-Qwen-32B 的速度怎么样?_MindIE_昇腾论坛
请教MindIE2.0RC2运行Qwen32-14B模型的参数_MindIE_昇腾论坛
Qwen3-14B的镜像改如何下载啊 _MindIE_昇腾论坛
mis-tei:7.1.T3-800I-A2-aarch64部署错误情况_MindIE_昇腾论坛
mindie运行qwen2.5-72B失败_MindIE_昇腾论坛
申请的mindie下载权限,请管理员尽快批准_MindIE_昇腾论坛
为什么PD分离部署场景下,指定openai格式接口内的 top_k,top_p,seed, temperature,beam等参数,全都不生效_MindIE_昇腾论坛
单卡Atlas 300I DUO 使用mindie 启动qwen2-7B报错_MindIE_昇腾论坛
Atlas 300-I-duo 96g的显卡支持什么ai大模型_MindIE_昇腾论坛
mindie2.0RC2 运行Qwen2.5-VL-32B-Instruct模型失败_MindIE_昇腾论坛
VLLM+ray 搭建分布式推理运行 Qwen3_235B,VLLM 跨节点寻找 npuID 逻辑错误_MindIE_昇腾论坛
适配Qwen2.5-Omni-7B的镜像包无法下载_MindIE_昇腾论坛
【MindIE】【DeepSeek】性能较差、推理较慢问题定位总结(持续更新~)_MindIE_昇腾论坛
310P3显卡跑Qwen2.5-7b-vl报错,参考https://www.hiascend.com/developer/ascendhub/detail/9eedc82e0c0644b2a2a9d0821ed5e7ad_MindIE_昇腾论坛
Atlas 300i Duo上部署qwen3-32b后启动失败_MindIE_昇腾论坛
2.1.RC1-300I-Duo-py311-openeuler24.03-lts在300iDUO卡部署Qwen3-30B-A3B模型报错_MindIE_昇腾论坛
2.1.RC1-300I-Duo-py311-openeuler24.03-lts在300iDUO卡部署Qwen3-30B-A3B模型报错_MindIE_昇腾论坛
2.1.RC1-300I-Duo-py311-openeuler24.03-lts在300iDUO卡部署Qwen3-30B-A3B模型报错_MindIE_昇腾论坛
MindIE 启动卡死,日志停滞在 model_runner.dtype,NPU 进程存在但模型未加载_MindIE_昇腾论坛
求助谁用最新的mindie2.1rc1在300i-duo的卡下面跑通qwen3的moe模型了?_MindIE_昇腾论坛
有偿求助帮忙在autodl上部署MindIE Server_MindIE_昇腾论坛
Atlas 300I Duo8卡运行DeepSeek-R1-Distill-Llama-70B异常_MindIE_昇腾论坛
之前昇腾产品公告中提到7月份会升级mindie到2.1rc1版本,什么时候能试用啊?_MindIE_昇腾论坛
多模型部署报错了__昇腾论坛
qwen2.5-72b-instruct运行中频繁卡顿:ai core 利用率总是莫名达到100%,然后模型推理卡住,持续约6-7分钟后释放_MindIE_昇腾论坛
qwen2.5-72b-instruct运行中频繁卡顿:ai core 利用率总是莫名达到100%,然后模型推理卡住,持续约6-7分钟后释放_MindIE_昇腾论坛
大EP场景下运行,D节点夯住,提升Slave waiting for master init flag_MindIE_昇腾论坛
MindIE 运行DeepSeek-R1-Distill-Qwen-32B 无法启动__昇腾论坛
《急帖求回》宿主机怎么部署Qwen3系列模型_MindIE_昇腾论坛
mindie2.0用2卡910B推理Qwen2.5-VL-3B正常,而切换为单卡推理报错_MindIE_昇腾论坛
Qwen2.5-VL-72B-Instruct模型分析图片报错{"error":"Failed to get engine response.","error_type":"Incomplete Generation"}_MindIE_昇腾论坛
基于MindIE2.0.RC2在300I Duo上单卡运行qwen2.5-vl-7b报错__昇腾论坛
300iDUO适配MiniCPM-V-2_6微调后大模型失败_MindIE_昇腾论坛
300iDUO适配MiniCPM-V-2_6微调后大模型失败_MindIE_昇腾论坛
bge-large-zh-v1.5模型部署后调用,出现 Bad Request Error, 413 Client Error_MindIE_昇腾论坛
atb-llm面壁大模型权重转换报错ValueError: safe_get_model_from_pretrained failed._MindIE_昇腾论坛
MiniCPM-V2.6-8Bmindie部署指导如何获取?_MindIE_昇腾论坛
大模型推理时常受限_MindIE_昇腾论坛
大模型推理时常受限_MindIE_昇腾论坛
大模型推理时常受限_MindIE_昇腾论坛
大家两张300I duo 跑DeepSeek-R1-Distill-Qwen-32B 的速度怎么样?_MindIE_昇腾论坛
Qwen3-14B的镜像改如何下载啊 _MindIE_昇腾论坛
Qwen3-14B的镜像改如何下载啊 _MindIE_昇腾论坛
单卡Atlas 300I DUO 使用mindie 启动qwen2-7B报错_MindIE_昇腾论坛
单卡Atlas 300I DUO 使用mindie 启动qwen2-7B报错_MindIE_昇腾论坛
单卡Atlas 300I DUO 使用mindie 启动qwen2-7B报错_MindIE_昇腾论坛
Atlas 300-I-duo 96g的显卡支持什么ai大模型_MindIE_昇腾论坛
mindie2.0RC2 运行Qwen2.5-VL-32B-Instruct模型失败_MindIE_昇腾论坛
适配Qwen2.5-Omni-7B的镜像包无法下载_MindIE_昇腾论坛
适配Qwen2.5-Omni-7B的镜像包无法下载_MindIE_昇腾论坛
裸机CPU高性能开启执行" cpupower -c all frequency-set -g performance"失败_MindIE_昇腾论坛
MindIE 启动 DeepSeek-R1-0528-Qwen3-8B 报错_MindIE_昇腾论坛
Qwen3-32B进行w4a4量化时报错copy_d2d:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:285 NPU function error: c10_npu::acl::Acl_MindIE_昇腾论坛
Qwen3-32B进行w4a4量化时报错copy_d2d:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:285 NPU function error: c10_npu::acl::Acl_MindIE_昇腾论坛
300I DUO 部署 Qwen2-VL-7B-Instruct 报错_MindIE_昇腾论坛
MindIE是否有计划增加结构化输出能力, 比如集成xgrammar库?_MindIE_昇腾论坛
【已解决】mindie加载 qwen2.5-14B-instruct-w8a8 报错AttributeError: 'ForkAwareLocal' object has no attribute 'connection‘_MindIE_昇腾论坛
有什么好用的ocr识别pdf文档,可以部署到910b服务器_MindIE_昇腾论坛
2.0.RC1-800I-A2-py311-openeuler24.03-lts 部署DeepSeekV3-BF16 在多并发下如何保持首响在1s内_MindIE_昇腾论坛
300I DUO推理速度极慢1token/s,是配置问题还是显卡性能问题?_MindIE_昇腾论坛
300I DUO推理速度极慢1token/s,是配置问题还是显卡性能问题?_MindIE_昇腾论坛
300I DUO推理速度极慢1token/s,是配置问题还是显卡性能问题?_MindIE_昇腾论坛
300I DUO推理速度极慢1token/s,是配置问题还是显卡性能问题?_MindIE_昇腾论坛
300I DUO推理速度极慢1token/s,是配置问题还是显卡性能问题?_MindIE_昇腾论坛
MindIE 文本/流式推理接口 是否支持上下文请求,如果支持如何使用_MindIE_昇腾论坛
thxcode/mindie:2.0.RC1-800I-A2-py311-openeuler24.03-lts 服务部署deepseekv3_fp16乱码_MindIE_昇腾论坛
model-config中'asyncBatchscheduler': 'false', 'async_infer': 'false', 'distributed_enable': 'false'_MindIE_昇腾论坛
2.0.RC1 mindie 910b4 创建rerank 和embedding失败_MindIE_昇腾论坛
300I Duo卡能用MindIE 部署DeepSeek 32B吗【最新MIndIE模型支撑列表里显示不支持】_MindIE_昇腾论坛
300I Duo卡能用MindIE 部署DeepSeek 32B吗【最新MIndIE模型支撑列表里显示不支持】_MindIE_昇腾论坛
启动dsv3报错_MindIE_昇腾论坛
mindie容器部署,宿主机环境问题_MindIE_昇腾论坛
mindie容器部署,宿主机环境问题_MindIE_昇腾论坛
mindie容器部署,宿主机环境问题_MindIE_昇腾论坛
mindie容器部署,宿主机环境问题_MindIE_昇腾论坛
mindie容器部署,宿主机环境问题__昇腾论坛
求个最新的mindie_2.0 用于部署qwen3和qwen2.5_vl_MindIE_昇腾论坛
求个最新的mindie_2.0 用于部署qwen3和qwen2.5_vl_MindIE_昇腾论坛
(mindieservice)There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracke_MindIE_昇腾论坛
mindie启动服务失败_MindIE_昇腾论坛
MindIE中执行命令报错:OpenBLAS blas_thread_init: pthread_create failed for thread 1 of 64: Operation not permitted_MindIE_昇腾论坛
(mindieservice)get platform info failed, drvErr=87. _MindIE_昇腾论坛
mindie 2.0 版本 部署qwen 2.5vl 72B,并发数为40时 出现如下错误Segmentation fault (core dumped),导致服务直接挂死_MindIE_昇腾论坛
HunyuanVideo视频生成部署问题,爆显存_MindIE_昇腾论坛
300iduo里运行docker版模型报错_MindIE_昇腾论坛
本地部署和Qwen3 32B求助_MindIE_昇腾论坛
本地部署和Qwen3 32B求助_MindIE_昇腾论坛
本地部署和Qwen3 32B求助_MindIE_昇腾论坛
910b4部署deekseep失败_MindIE_昇腾论坛
910b4部署deekseep失败_MindIE_昇腾论坛
mis-tei 是否支持 Qwen3-Embedding 及 Qwen3-Reranker_MindIE_昇腾论坛
mis-tei 启动后一直输出 waiting for python backend to be ready_MindIE_昇腾论坛
Qwen3-32B的mindie:2.0.T17镜像还能下载吗_MindIE_昇腾论坛
ascend-device-pulgin-branch_v6.0.0.0-RC3 K8s部署大模型报错_MindIE_昇腾论坛
双机直连跑deepseekINT8量化模型,不报错但日志卡住_MindIE_昇腾论坛
(DMA) hardware execution error_MindIE_昇腾论坛
MindIe2.0RC1 容器化部署时必需使用24.0.0 及以上版本吗,20.0.0.0 能否进行部署?_MindIE_昇腾论坛
Ascend 310啥时候可以兼容Qwen3-14B呀?_MindIE_昇腾论坛
使用openai 接口 在多模态任务下 历史多轮数据格式问题_MindIE_昇腾论坛
MindIE Server使用https报错_MindIE_昇腾论坛
-
+
首页
使用qwen2.5-vl-7b-instruct镜像跑qwen2.5-vl-7b-instruct模型对话测试报错_MindIE_昇腾论坛
# 使用qwen2.5-vl-7b-instruct镜像跑qwen2.5-vl-7b-instruct模型对话测试报错_MindIE_昇腾论坛 ## 概述 本文档基于昇腾社区论坛帖子生成的技术教程。 **原始链接**: https://www.hiascend.com/forum/thread-0279187752695142047-1-1.html **生成时间**: 2025-08-27 10:33:58 --- ## 问题描述 使用镜像swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct 7.1.T2-800I-A2-aarch64 cbc1e2e038cf 8 weeks ago 14.8GB 使用模型https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct 参考文档https://www.hiascend.com/developer/ascendhub/detail/9eedc82e0c0644b2a2a9d0821ed5e7ad 启动命令 # 设置容器名称 export CONTAINER_NAME=qwen2.5-vl-7b-instruct # 选择镜像 export IMG_NAME=swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct:7.1.T2-800I-A2-aarch64 # 启动推理微服务使用ASCEND_VISIBLE_DEVICES选择卡号范围[07]示例选择0,1卡 docker run -itd \ --name=$CONTAINER_NAME \ -e ASCEND_VISIBLE_DEVICES=4,5 \ -e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default \ -e MIS_LIMIT_VIDEO_PER_PROMPT=1 \ -v $LOCAL_CACHE_PATH:/opt/mis/.cache \ -p 8000:8000 \ --shm-size 1gb \ $IMG_NAME复制 容器日志 INFO 07-12 06:46:15 [__init__.py:44] plugin ascend loaded. INFO 07-12 06:46:15 [__init__.py:230] Platform plugin ascend is activated WARNING 07-12 06:46:17 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vll... ## 相关代码 ### 代码示例 1 ``` # 设置容器名称 export CONTAINER_NAME=qwen2.5-vl-7b-instruct # 选择镜像 export IMG_NAME=swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct:7.1.T2-800I-A2-aarch64 # 启动推理微服务,使用ASCEND_VISIBLE_DEVICES选择卡号,范围[0,7],示例选择0,1卡 docker run -itd \ --name=$CONTAINER_NAME \ -e ASCEND_VISIBLE_DEVICES=4,5 \ -e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default \ -e MIS_LIMIT_VIDEO_PER_PROMPT=1 \ -v $LOCAL_CACHE_PATH:/opt/mis/.cache \ -p 8000:8000 \ --shm-size 1gb \ $IMG_NAME复制 ``` ### 代码示例 2 ``` # 设置容器名称 export CONTAINER_NAME=qwen2.5-vl-7b-instruct # 选择镜像 export IMG_NAME=swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct:7.1.T2-800I-A2-aarch64 # 启动推理微服务,使用ASCEND_VISIBLE_DEVICES选择卡号,范围[0,7],示例选择0,1卡 docker run -itd \ --name=$CONTAINER_NAME \ -e ASCEND_VISIBLE_DEVICES=4,5 \ -e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default \ -e MIS_LIMIT_VIDEO_PER_PROMPT=1 \ -v $LOCAL_CACHE_PATH:/opt/mis/.cache \ -p 8000:8000 \ --shm-size 1gb \ $IMG_NAME ``` ### 代码示例 3 ``` INFO 07-12 06:46:15 [__init__.py:44] plugin ascend loaded. INFO 07-12 06:46:15 [__init__.py:230] Platform plugin ascend is activated WARNING 07-12 06:46:17 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") INFO 07-12 06:46:18 mis_launcher:8] Local model path is /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct INFO 07-12 06:46:18 mis_launcher:8] Found model weight cached in path /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct, local model weight will be used INFO 07-12 06:46:18 __init__.py:61] MIS API server INFO 07-12 06:46:18 __init__.py:61] args: cache_path='/opt/mis/.cache' model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct' engine_type='vllm' served_model_name='Qwen2.5-VL-7B-Instruct' max_model_len=None enable_prefix_caching=False mis_config='atlas800ia2-2x32gb-bf16-vllm-default' host=None port=8000 inner_port=9090 ssl_keyfile=None ssl_certfile=None ssl_ca_certs=None ssl_cert_reqs=0 log_level='INFO' max_log_len=None disable_log_requests=False disable_log_stats=False api_key=None disable_fastapi_docs=False allowed_local_media_path='/opt' limit_image_per_prompt=0 limit_video_per_prompt=1 limit_audio_per_prompt=0 uvicorn_log_level='info' engine_optimization_config={'dtype': 'bfloat16', 'tensor_parallel_size': 2, 'pipeline_parallel_size': 1, 'distributed_executor_backend': 'mp', 'max_num_seqs': 128, 'max_model_len': 16384, 'max_num_batched_tokens': 16384, 'max_seq_len_to_capture': 16384, 'gpu_memory_utilization': 0.9, 'block_size': 32, 'swap_space': 4, 'cpu_offload_gb': 0, 'scheduling_policy': 'fcfs', 'enforce_eager': True} INFO 07-12 06:46:18 contextlib.py:199] Using vllm backend INFO 07-12 06:46:18 [__init__.py:30] Available plugins for group vllm.general_plugins: INFO 07-12 06:46:18 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. INFO 07-12 06:46:28 [arg_utils.py:1669] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine. INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default INFO 07-12 06:46:28 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, WARNING 07-12 06:46:28 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (VllmWorkerProcess pid=273) INFO 07-12 06:46:28 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks (VllmWorkerProcess pid=273) WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc45030> WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc44ee0> INFO 07-12 06:46:49 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_a3682db0'), local_subscribe_addr='ipc:///tmp/63165690-a5fa-4d37-9f5f-c2d04efe8acd', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 INFO 07-12 06:46:49 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct... (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct... (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] (VllmWorkerProcess pid=273) WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.73it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:00<00:01, 2.92it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.14it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.82it/s] (VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.66 seconds Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.67it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.83it/s] INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.85 seconds (VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [model_runner.py:948] Loading model weights took 7.8691 GB INFO 07-12 06:46:53 [model_runner.py:948] Loading model weights took 7.8691 GB Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. (VllmWorkerProcess pid=273) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.) hidden_states = hidden_states[window_index, :, :] (VllmWorkerProcess pid=273) /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.) (VllmWorkerProcess pid=273) hidden_states = hidden_states[window_index, :, :] /usr/local/lib/python3.10/dist-packages/torch_npu/distributed/distributed_c10d.py:117: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead. warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.") INFO 07-12 06:47:33 [executor_base.py:112] # npu blocks: 47616, # CPU blocks: 4681 INFO 07-12 06:47:33 [executor_base.py:117] Maximum concurrency for 16384 tokens per request: 93.00x INFO 07-12 06:47:34 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 41.98 seconds WARNING 07-12 06:47:36 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. INFO 07-12 06:47:36 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06} INFO 07-12 06:47:36 [launcher.py:28] Available routes are: INFO 07-12 06:47:36 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /docs, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /redoc, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/models, Methods: GET INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/chat/completions, Methods: POST INFO: Started server process [73] INFO: Waiting for application startup. INFO: Application startup complete.复制 ``` ### 代码示例 4 ``` INFO 07-12 06:46:15 [__init__.py:44] plugin ascend loaded. INFO 07-12 06:46:15 [__init__.py:230] Platform plugin ascend is activated WARNING 07-12 06:46:17 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") INFO 07-12 06:46:18 mis_launcher:8] Local model path is /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct INFO 07-12 06:46:18 mis_launcher:8] Found model weight cached in path /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct, local model weight will be used INFO 07-12 06:46:18 __init__.py:61] MIS API server INFO 07-12 06:46:18 __init__.py:61] args: cache_path='/opt/mis/.cache' model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct' engine_type='vllm' served_model_name='Qwen2.5-VL-7B-Instruct' max_model_len=None enable_prefix_caching=False mis_config='atlas800ia2-2x32gb-bf16-vllm-default' host=None port=8000 inner_port=9090 ssl_keyfile=None ssl_certfile=None ssl_ca_certs=None ssl_cert_reqs=0 log_level='INFO' max_log_len=None disable_log_requests=False disable_log_stats=False api_key=None disable_fastapi_docs=False allowed_local_media_path='/opt' limit_image_per_prompt=0 limit_video_per_prompt=1 limit_audio_per_prompt=0 uvicorn_log_level='info' engine_optimization_config={'dtype': 'bfloat16', 'tensor_parallel_size': 2, 'pipeline_parallel_size': 1, 'distributed_executor_backend': 'mp', 'max_num_seqs': 128, 'max_model_len': 16384, 'max_num_batched_tokens': 16384, 'max_seq_len_to_capture': 16384, 'gpu_memory_utilization': 0.9, 'block_size': 32, 'swap_space': 4, 'cpu_offload_gb': 0, 'scheduling_policy': 'fcfs', 'enforce_eager': True} INFO 07-12 06:46:18 contextlib.py:199] Using vllm backend INFO 07-12 06:46:18 [__init__.py:30] Available plugins for group vllm.general_plugins: INFO 07-12 06:46:18 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. INFO 07-12 06:46:28 [arg_utils.py:1669] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine. INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default INFO 07-12 06:46:28 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, WARNING 07-12 06:46:28 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (VllmWorkerProcess pid=273) INFO 07-12 06:46:28 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks (VllmWorkerProcess pid=273) WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc45030> WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc44ee0> INFO 07-12 06:46:49 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_a3682db0'), local_subscribe_addr='ipc:///tmp/63165690-a5fa-4d37-9f5f-c2d04efe8acd', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 INFO 07-12 06:46:49 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct... (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct... (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] (VllmWorkerProcess pid=273) WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.73it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:00<00:01, 2.92it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.14it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.82it/s] (VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.66 seconds Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.67it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.83it/s] INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.85 seconds (VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [model_runner.py:948] Loading model weights took 7.8691 GB INFO 07-12 06:46:53 [model_runner.py:948] Loading model weights took 7.8691 GB Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. (VllmWorkerProcess pid=273) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.) hidden_states = hidden_states[window_index, :, :] (VllmWorkerProcess pid=273) /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.) (VllmWorkerProcess pid=273) hidden_states = hidden_states[window_index, :, :] /usr/local/lib/python3.10/dist-packages/torch_npu/distributed/distributed_c10d.py:117: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead. warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.") INFO 07-12 06:47:33 [executor_base.py:112] # npu blocks: 47616, # CPU blocks: 4681 INFO 07-12 06:47:33 [executor_base.py:117] Maximum concurrency for 16384 tokens per request: 93.00x INFO 07-12 06:47:34 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 41.98 seconds WARNING 07-12 06:47:36 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. INFO 07-12 06:47:36 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06} INFO 07-12 06:47:36 [launcher.py:28] Available routes are: INFO 07-12 06:47:36 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /docs, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /redoc, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/models, Methods: GET INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/chat/completions, Methods: POST INFO: Started server process [73] INFO: Waiting for application startup. INFO: Application startup complete. ``` ### 代码示例 5 ``` curl http://192.168.2.201:8000/openai/v1/models复制 ``` ## 相关图片 ### 图片 1  **图片地址**: https://fileserver.developer.huaweicloud.com/FileServer/getFile/cmtybbs/cc4/12e/958/39430f23a1cc412e95877834266352f7.20250720084551.15563410970905277726184222861520:20250827021045:2400:C96AE0FFA7C8C54D634248121B0DD560601624740ADFAD03EBB62E61864AC945.png **图片描述**: cke_612.png ## 完整内容 使用镜像swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct 7.1.T2-800I-A2-aarch64 cbc1e2e038cf 8 weeks ago 14.8GB 使用模型https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct 参考文档https://www.hiascend.com/developer/ascendhub/detail/9eedc82e0c0644b2a2a9d0821ed5e7ad 启动命令 # 设置容器名称 export CONTAINER_NAME=qwen2.5-vl-7b-instruct # 选择镜像 export IMG_NAME=swr.cn-south-1.myhuaweicloud.com/ascendhub/qwen2.5-vl-7b-instruct:7.1.T2-800I-A2-aarch64 # 启动推理微服务使用ASCEND_VISIBLE_DEVICES选择卡号范围[07]示例选择0,1卡 docker run -itd \ --name=$CONTAINER_NAME \ -e ASCEND_VISIBLE_DEVICES=4,5 \ -e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default \ -e MIS_LIMIT_VIDEO_PER_PROMPT=1 \ -v $LOCAL_CACHE_PATH:/opt/mis/.cache \ -p 8000:8000 \ --shm-size 1gb \ $IMG_NAME复制 容器日志 INFO 07-12 06:46:15 [__init__.py:44] plugin ascend loaded. INFO 07-12 06:46:15 [__init__.py:230] Platform plugin ascend is activated WARNING 07-12 06:46:17 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") INFO 07-12 06:46:18 mis_launcher:8] Local model path is /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct INFO 07-12 06:46:18 mis_launcher:8] Found model weight cached in path /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct, local model weight will be used INFO 07-12 06:46:18 __init__.py:61] MIS API server INFO 07-12 06:46:18 __init__.py:61] args: cache_path='/opt/mis/.cache' model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct' engine_type='vllm' served_model_name='Qwen2.5-VL-7B-Instruct' max_model_len=None enable_prefix_caching=False mis_config='atlas800ia2-2x32gb-bf16-vllm-default' host=None port=8000 inner_port=9090 ssl_keyfile=None ssl_certfile=None ssl_ca_certs=None ssl_cert_reqs=0 log_level='INFO' max_log_len=None disable_log_requests=False disable_log_stats=False api_key=None disable_fastapi_docs=False allowed_local_media_path='/opt' limit_image_per_prompt=0 limit_video_per_prompt=1 limit_audio_per_prompt=0 uvicorn_log_level='info' engine_optimization_config={'dtype': 'bfloat16', 'tensor_parallel_size': 2, 'pipeline_parallel_size': 1, 'distributed_executor_backend': 'mp', 'max_num_seqs': 128, 'max_model_len': 16384, 'max_num_batched_tokens': 16384, 'max_seq_len_to_capture': 16384, 'gpu_memory_utilization': 0.9, 'block_size': 32, 'swap_space': 4, 'cpu_offload_gb': 0, 'scheduling_policy': 'fcfs', 'enforce_eager': True} INFO 07-12 06:46:18 contextlib.py:199] Using vllm backend INFO 07-12 06:46:18 [__init__.py:30] Available plugins for group vllm.general_plugins: INFO 07-12 06:46:18 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. INFO 07-12 06:46:28 [arg_utils.py:1669] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine. INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default INFO 07-12 06:46:28 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. INFO 07-12 06:46:28 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform. WARNING 07-12 06:46:28 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:28 [platform.py:133] Compilation disabled, using eager mode by default INFO 07-12 06:46:28 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, WARNING 07-12 06:46:28 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (VllmWorkerProcess pid=273) INFO 07-12 06:46:28 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks (VllmWorkerProcess pid=273) WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc45030> WARNING 07-12 06:46:30 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdbcc44ee0> INFO 07-12 06:46:49 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_a3682db0'), local_subscribe_addr='ipc:///tmp/63165690-a5fa-4d37-9f5f-c2d04efe8acd', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 INFO 07-12 06:46:49 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct... (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [model_runner.py:943] Starting to load model /opt/mis/.cache/MindSDK/Qwen2.5-VL-7B-Instruct... (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] INFO 07-12 06:46:49 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] (VllmWorkerProcess pid=273) WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. (VllmWorkerProcess pid=273) INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default WARNING 07-12 06:46:49 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. INFO 07-12 06:46:49 [platform.py:133] Compilation disabled, using eager mode by default Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.73it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:00<00:01, 2.92it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.14it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.82it/s] (VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.66 seconds Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.67it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.83it/s] INFO 07-12 06:46:52 [loader.py:458] Loading weights took 2.85 seconds (VllmWorkerProcess pid=273) INFO 07-12 06:46:52 [model_runner.py:948] Loading model weights took 7.8691 GB INFO 07-12 06:46:53 [model_runner.py:948] Loading model weights took 7.8691 GB Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. (VllmWorkerProcess pid=273) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.) hidden_states = hidden_states[window_index, :, :] (VllmWorkerProcess pid=273) /opt/vllm-ascend/vllm/vllm/model_executor/models/qwen2_5_vl.py:668: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:128.) (VllmWorkerProcess pid=273) hidden_states = hidden_states[window_index, :, :] /usr/local/lib/python3.10/dist-packages/torch_npu/distributed/distributed_c10d.py:117: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead. warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.") INFO 07-12 06:47:33 [executor_base.py:112] # npu blocks: 47616, # CPU blocks: 4681 INFO 07-12 06:47:33 [executor_base.py:117] Maximum concurrency for 16384 tokens per request: 93.00x INFO 07-12 06:47:34 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 41.98 seconds WARNING 07-12 06:47:36 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. INFO 07-12 06:47:36 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06} INFO 07-12 06:47:36 [launcher.py:28] Available routes are: INFO 07-12 06:47:36 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /docs, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /redoc, Methods: GET, HEAD INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/models, Methods: GET INFO 07-12 06:47:36 [launcher.py:36] Route: /openai/v1/chat/completions, Methods: POST INFO: Started server process [73] INFO: Waiting for application startup. INFO: Application startup complete.复制 获取模型信息 curl http://192.168.2.201:8000/openai/v1/models复制 返回 {"object":"list","data":[{"id":"Qwen2.5-VL-7B-Instruct","object":"model","created":1752370119,"owned_by":"vllm","max_model_len":16384}]}复制 对话测试 curl http://192.168.2.201:8000/openai/v1/chat/completions -X POST -d'{ "model": "Qwen2.5-7B-Instruct", "prompt": "解释量子力学基础概念", "max_tokens": 200, "temperature": 0.7 }'复制 报错 INFO: 192.168.2.201:40374 - "GET /openai/v1/models HTTP/1.1" 200 OK INFO: 192.168.2.201:41152 - "POST /openai/v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__ return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__ raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__ await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 714, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 734, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 73, in app response = await f(request) File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 291, in app solved_result = await solve_dependencies( File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 666, in solve_dependencies ) = await request_body_to_args( # body_params checked above File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 906, in request_body_to_args v_, errors_ = _validate_value_with_model_field( File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 706, in _validate_value_with_model_field v_, errors_ = field.validate(value, values, loc=loc) File "/usr/local/lib/python3.10/dist-packages/fastapi/_compat.py", line 129, in validate self._type_adapter.validate_python(value, from_attributes=True), File "/usr/local/lib/python3.10/dist-packages/pydantic/type_adapter.py", line 421, in validate_python return self.validator.validate_python( File "/opt/vllm-ascend/vllm/vllm/entrypoints/openai/protocol.py", line 56, in __log_extra_fields__ result = handler(data) File "/opt/vllm-ascend/vllm/vllm/entrypoints/openai/protocol.py", line 723, in check_generation_prompt if data.get("continue_final_message") and data.get( AttributeError: 'bytes' object has no attribute 'get'复制 您好请问下您使用的模型是Qwen2.5-VL-7B 还是 Qwen2.5-7B 模型使用的2.5vl7B配置文件的模型名称配置不严谨了问题已经通过使用MindIE容器+2.5VL适配代码解决了 -e MIS_CONFIG=atlas800ia2-2x64gb-bf16-vllm-default 310p3 用什么参数 --- ## 技术要点总结 基于以上内容,主要技术要点包括: 1. **问题类型**: 错误处理 2. **涉及技术**: MindIE, HTTPS, SSL, GPU, NPU, Docker, Atlas, 昇腾, CANN, AI 3. **解决方案**: 请参考完整内容中的解决方案 ## 相关资源 - 昇腾社区: https://www.hiascend.com/ - 昇腾论坛: https://www.hiascend.com/forum/ --- *本文档由AI自动生成,仅供参考。如有疑问,请参考原始帖子。*
yg9538
2025年8月27日 11:08
转发文档
收藏文档
上一篇
下一篇
手机扫码
复制链接
手机扫一扫转发分享
复制链接
Markdown文件
Word文件
PDF文档
PDF文档(打印)
分享
链接
类型
密码
更新密码
有效期