MindIE 运行DeepSeek-R1-Distill-Qwen-32B 无法启动_MindIE_昇腾论坛

# MindIE 运行DeepSeek-R1-Distill-Qwen-32B 无法启动_MindIE_昇腾论坛

## 概述

本文档基于昇腾社区论坛帖子生成的技术教程。

**原始链接**: https://www.hiascend.com/forum/thread-02117188700572290225-1-1.html

**生成时间**: 2025-08-27 10:33:58

---

## 问题描述

环境 NPU: Atlas 300I Duo * 4 OS: openeuler 22.03 LTS arm64 Mindie镜像 mindie:2.0.RC2-800I-A2-py311-openeuler24.03-lts 驱动版本24.1.0.1 固件版本7.5.0.5.220 宿主机npu-smi No running processes found in NPU 1 No running processes found in NPU 2 No running processes found in NPU 4 No running processes found in NPU 5 复制 容器内npu-smi No running processes found in NPU 0 No running processes found in NPU 32 No running processes found in NPU 32768 No running processes found in NPU 32800 复制 容器内dev/davinci* davinci0 davinci1 davinci2 davinci3 davinci4 davinci5 davinci6 davinci7 davinci manager 复制 启动命令 docker run -it -d --net=host --shm-size=1g \ --privileged \ --name ai \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device=/dev/devmm_svm \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \ -v /usr/local/sbin:/usr/local/sbin:ro \ -v /data/npu/modelscope:/path-to-weights:ro \ swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.RC2-300I-Duo-py311-openeuler24.03-lts bash 复制 模型config.json { "arch...

## 相关代码

### 代码示例 1

```
No running processes found in NPU 1
No running processes found in NPU 2
No running processes found in NPU 4
No running processes found in NPU 5
复制
```

### 代码示例 2

```
No running processes found in NPU 1
No running processes found in NPU 2
No running processes found in NPU 4
No running processes found in NPU 5
```

### 代码示例 3

```
No running processes found in NPU 0
No running processes found in NPU 32
No running processes found in NPU 32768
No running processes found in NPU 32800
复制
```

### 代码示例 4

```
No running processes found in NPU 0
No running processes found in NPU 32
No running processes found in NPU 32768
No running processes found in NPU 32800
```

### 代码示例 5

```
davinci0
davinci1
davinci2
davinci3
davinci4
davinci5
davinci6
davinci7
davinci manager
复制
```

## 相关图片

### 图片 1

![cke_166.png](https://fileserver.developer.huaweicloud.com/FileServer/getFile/cmtybbs/3b4/b77/b8c/482ecb08413b4b77b8c2a04191392446.20250724014823.92434149821804185904910506930492:20250827020745:2400:E3F6D9A575A6AF1A37F4B7ABBCC7D05A8CF16B0CF2558D3A99821E50AAA2E79E.png)

**图片地址**: https://fileserver.developer.huaweicloud.com/FileServer/getFile/cmtybbs/3b4/b77/b8c/482ecb08413b4b77b8c2a04191392446.20250724014823.92434149821804185904910506930492:20250827020745:2400:E3F6D9A575A6AF1A37F4B7ABBCC7D05A8CF16B0CF2558D3A99821E50AAA2E79E.png

**图片描述**: cke_166.png

## 完整内容

环境 NPU: Atlas 300I Duo * 4 OS: openeuler 22.03 LTS arm64 Mindie镜像 mindie:2.0.RC2-800I-A2-py311-openeuler24.03-lts 驱动版本24.1.0.1 固件版本7.5.0.5.220 宿主机npu-smi No running processes found in NPU 1 No running processes found in NPU 2 No running processes found in NPU 4 No running processes found in NPU 5 复制 容器内npu-smi No running processes found in NPU 0 No running processes found in NPU 32 No running processes found in NPU 32768 No running processes found in NPU 32800 复制 容器内dev/davinci* davinci0 davinci1 davinci2 davinci3 davinci4 davinci5 davinci6 davinci7 davinci manager 复制 启动命令 docker run -it -d --net=host --shm-size=1g \ --privileged \ --name ai \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device=/dev/devmm_svm \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \ -v /usr/local/sbin:/usr/local/sbin:ro \ -v /data/npu/modelscope:/path-to-weights:ro \ swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.RC2-300I-Duo-py311-openeuler24.03-lts bash 复制 模型config.json { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 27648, "max_position_embeddings": 131072, "max_window_layers": 64, "model_type": "qwen2", "num_attention_heads": 40, "num_hidden_layers": 64, "num_key_value_heads": 8, "rms_norm_eps": 1e-05, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.43.1", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } 复制 MindIE config.json { "Version" : "1.0.0", "ServerConfig" : { "ipAddress" : "127.0.0.1", "managementIpAddress" : "127.0.0.2", "port" : 1040, "managementPort" : 1041, "metricsPort" : 1042, "allowAllZeroIpListening" : false, "maxLinkNum" : 1000, "httpsEnabled" : false, "fullTextEnabled" : false, "tlsCaPath" : "security/ca/", "tlsCaFile" : ["ca.pem"], "tlsCert" : "security/certs/server.pem", "tlsPk" : "security/keys/server.key.pem", "tlsPkPwd" : "security/pass/key_pwd.txt", "tlsCrlPath" : "security/certs/", "tlsCrlFiles" : ["server_crl.pem"], "managementTlsCaFile" : ["management_ca.pem"], "managementTlsCert" : "security/certs/management/server.pem", "managementTlsPk" : "security/keys/management/server.key.pem", "managementTlsPkPwd" : "security/pass/management/key_pwd.txt", "managementTlsCrlPath" : "security/management/certs/", "managementTlsCrlFiles" : ["server_crl.pem"], "kmcKsfMaster" : "tools/pmt/master/ksfa", "kmcKsfStandby" : "tools/pmt/standby/ksfb", "inferMode" : "standard", "interCommTLSEnabled" : true, "interCommPort" : 1121, "interCommTlsCaPath" : "security/grpc/ca/", "interCommTlsCaFiles" : ["ca.pem"], "interCommTlsCert" : "security/grpc/certs/server.pem", "interCommPk" : "security/grpc/keys/server.key.pem", "interCommPkPwd" : "security/grpc/pass/key_pwd.txt", "interCommTlsCrlPath" : "security/grpc/certs/", "interCommTlsCrlFiles" : ["server_crl.pem"], "openAiSupport" : "vllm", "tokenTimeout" : 600, "e2eTimeout" : 600, "distDPServerEnabled":false }, "BackendConfig" : { "backendName" : "mindieservice_llm_engine", "modelInstanceNumber" : 1, "npuDeviceIds" : [[0,1]], "tokenizerProcessNumber" : 8, "multiNodesInferEnabled" : false, "multiNodesInferPort" : 1120, "interNodeTLSEnabled" : true, "interNodeTlsCaPath" : "security/grpc/ca/", "interNodeTlsCaFiles" : ["ca.pem"], "interNodeTlsCert" : "security/grpc/certs/server.pem", "interNodeTlsPk" : "security/grpc/keys/server.key.pem", "interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt", "interNodeTlsCrlPath" : "security/grpc/certs/", "interNodeTlsCrlFiles" : ["server_crl.pem"], "interNodeKmcKsfMaster" : "tools/pmt/master/ksfa", "interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb", "ModelDeployConfig" : { "maxSeqLen" : 2560, "maxInputTokenLen" : 2048, "truncation" : false, "ModelConfig" : [ { "modelInstanceType" : "Standard", "modelName" : "qwen", "modelWeightPath" : "/path-to-weights", "worldSize" : 2, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false } ] }, "ScheduleConfig" : { "templateType" : "Standard", "templateName" : "Standard_LLM", "cacheBlockSize" : 128, "maxPrefillBatchSize" : 50, "maxPrefillTokens" : 8192, "prefillTimeMsPerReq" : 150, "prefillPolicyType" : 0, "decodeTimeMsPerReq" : 50, "decodePolicyType" : 0, "maxBatchSize" : 200, "maxIterTimes" : 512, "maxPreemptCount" : 0, "supportSelectBatch" : false, "maxQueueDelayMicroseconds" : 5000 } } } ## 启动日志 ```txt [root@bm-node03 bin]# ./mindieservice_daemon [WARNING] Check path: config.json failed, by:Check path: config.json failed, by:owner id diff: current process user id is 0, file owner id is 1000 The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. Log default log dir is ~/mindie/log, your can use env MINDIE_LOG_PATH to change log saving dir. Log default log dir is ~/mindie/log, your can use env MINDIE_LOG_PATH to change log saving dir. Log default log dir is ~/mindie/log, your can use env MINDIE_LOG_PATH to change log saving dir. [msservice_profiler] [PID:3180] [DEBUG] [ReadEnable:344] profile enable_: false [msservice_profiler] [PID:3182] [DEBUG] [ReadEnable:344] profile enable_: false [msservice_profiler] [PID:3180] [DEBUG] [ReadAclTaskTime:372] profile enableAclTaskTime_: false [msservice_profiler] [PID:3182] [DEBUG] [ReadAclTaskTime:372] profile enableAclTaskTime_: false The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! /usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Daemon is killing... Killed 复制 ``` 试过并不好使 报啥错呢 报错信息贴在楼下了 [2025-07-24 14:50:20,668] torch.distributed.run: [WARNING] [2025-07-24 14:50:20,668] torch.distributed.run: [WARNING] ***************************************** [2025-07-24 14:50:20,668] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2025-07-24 14:50:20,668] torch.distributed.run: [WARNING] ***************************************** The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible. The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible. The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible. The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible. [2025-07-24 14:50:31,572] [925] [281473776835520] [llmmodels] [INFO] [cpu_binding.py-254] : rank_id: 1, device_id: 1, numa_id: 0, shard_devices: [0, 1, 2, 3], cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] [2025-07-24 14:50:31,576] [925] [281473776835520] [llmmodels] [INFO] [cpu_binding.py-280] : process 925, new_affinity is [6, 7, 8, 9, 10, 11], cpu count 6 [2025-07-24 14:50:31,715] [926] [281473398946752] [llmmodels] [INFO] [cpu_binding.py-254] : rank_id: 2, device_id: 2, numa_id: 0, shard_devices: [0, 1, 2, 3], cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] [2025-07-24 14:50:31,718] [926] [281473398946752] [llmmodels] [INFO] [cpu_binding.py-280] : process 926, new_affinity is [12, 13, 14, 15, 16, 17], cpu count 6 [2025-07-24 14:50:32,115] [924] [281473518373824] [llmmodels] [INFO] [cpu_binding.py-254] : rank_id: 0, device_id: 0, numa_id: 0, shard_devices: [0, 1, 2, 3], cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] [2025-07-24 14:50:32,118] [924] [281473518373824] [llmmodels] [INFO] [cpu_binding.py-280] : process 924, new_affinity is [0, 1, 2, 3, 4, 5], cpu count 6 [2025-07-24 14:50:32,233] [927] [281473191488448] [llmmodels] [INFO] [cpu_binding.py-254] : rank_id: 3, device_id: 3, numa_id: 0, shard_devices: [0, 1, 2, 3], cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] [2025-07-24 14:50:32,236] [927] [281473191488448] [llmmodels] [INFO] [cpu_binding.py-280] : process 927, new_affinity is [18, 19, 20, 21, 22, 23], cpu count 6 [2025-07-24 14:50:32,791] [924] [281473518373824] [llmmodels] [INFO] [model_runner.py-154] : model_runner.quantize: None, model_runner.kv_quant_type: None, model_runner.fa_quant_type: None, model_runner.dtype: torch.float16 [2025-07-24 14:50:41,554] [927] [281473191488448] [llmmodels] [INFO] [dist.py-81] : initialize_distributed has been Set [2025-07-24 14:50:41,810] [927] [281473191488448] [llmmodels] [INFO] [flash_causal_qwen2.py-152] : >>>> qwen_QwenDecoderModel is called. [2025-07-24 14:50:42,022] [926] [281473398946752] [llmmodels] [INFO] [dist.py-81] : initialize_distributed has been Set [2025-07-24 14:50:42,234] [926] [281473398946752] [llmmodels] [INFO] [flash_causal_qwen2.py-152] : >>>> qwen_QwenDecoderModel is called. [2025-07-24 14:50:42,430] [924] [281473518373824] [llmmodels] [INFO] [dist.py-81] : initialize_distributed has been Set [2025-07-24 14:50:42,454] [924] [281473518373824] [llmmodels] [INFO] [model_runner.py-176] : init tokenizer done [2025-07-24 14:50:42,558] [924] [281473518373824] [llmmodels] [INFO] [flash_causal_qwen2.py-152] : >>>> qwen_QwenDecoderModel is called. [2025-07-24 14:50:42,866] [925] [281473776835520] [llmmodels] [INFO] [dist.py-81] : initialize_distributed has been Set [2025-07-24 14:50:43,110] [925] [281473776835520] [llmmodels] [INFO] [flash_causal_qwen2.py-152] : >>>> qwen_QwenDecoderModel is called. [2025-07-24 14:50:59,004] [924] [281473518373824] [llmmodels] [INFO] [model_runner.py-269] : model: FlashQwen2ForCausalLM( (rotary_embedding): PositionRotaryEmbedding() (attn_mask): AttentionMask() (transformer): FlashQwenModel( (wte): TensorParallelEmbedding() (h): ModuleList( (0-47): 48 x FlashQwenLayer( (attn): FlashQwenAttention( (rotary_emb): PositionRotaryEmbedding() (c_attn): TensorParallelColumnLinear( (linear): FastLinear() ) (c_proj): TensorParallelRowLinear( (linear): FastLinear() ) ) (mlp): QwenMLP( (act): SiLU() (w2_w1): TensorParallelColumnLinear( (linear): FastLinear() ) (c_proj): TensorParallelRowLinear( (linear): FastLinear() ) ) (ln_1): QwenRMSNorm() (ln_2): QwenRMSNorm() ) ) (ln_f): QwenRMSNorm() ) (lm_head): TensorParallelHead( (linear): FastLinear() ) ) [2025-07-24 14:51:00,588] [926] [281473398946752] [llmmodels] [INFO] [cache.py-153] : kv cache will allocate 0.052734375GB memory [2025-07-24 14:51:00,993] [927] [281473191488448] [llmmodels] [INFO] [cache.py-153] : kv cache will allocate 0.052734375GB memory [2025-07-24 14:51:01,921] [924] [281473518373824] [llmmodels] [INFO] [run_pa.py-131] : hbm_capacity(GB): 43.2421875, init_memory(GB): 9.080859374254942 [2025-07-24 14:51:01,921] [924] [281473518373824] [llmmodels] [INFO] [run_pa.py-546] : pa_runner: PARunner(model_path=/path-to-weights/, input_text=None, max_position_embeddings=None, max_input_length=1024, max_output_length=20, max_prefill_tokens=-1, load_tokenizer=True, enable_atb_torch=False, max_prefill_batch_size=None, max_batch_size=1, dtype=torch.float16, block_size=128, model_config=ModelConfig(num_heads=10, num_kv_heads=2, num_kv_heads_origin=8, head_size=128, k_head_size=128, v_head_size=128, num_layers=48, device=npu:0, dtype=torch.float16, soc_info=NPUSocInfo(soc_name='', soc_version=202, need_nz=True, matmul_nd_nz=False), kv_quant_type=None, fa_quant_type=None, mapping=Mapping(world_size=4, rank=0, num_nodes=1,pp_rank=0, pp_groups=[[0], [1], [2], [3]], micro_batch_size=1, attn_dp_groups=[[0], [1], [2], [3]], attn_tp_groups=[[0, 1, 2, 3]], attn_inner_sp_groups=[[0], [1], [2], [3]], attn_o_proj_tp_groups=[[0], [1], [2], [3]], mlp_tp_groups=[[0, 1, 2, 3]], moe_ep_groups=[[0], [1], [2], [3]], moe_tp_groups=[[0, 1, 2, 3]]), cla_share_factor=1, model_type=qwen2, nz_cache=False), max_memory=46430945280, [2025-07-24 14:51:01,922] [924] [281473518373824] [llmmodels] [INFO] [run_pa.py-244] : ---------------begin warm_up--------------- [2025-07-24 14:51:01,922] [924] [281473518373824] [llmmodels] [INFO] [cache.py-153] : kv cache will allocate 0.052734375GB memory [2025-07-24 14:51:01,926] [924] [281473518373824] [llmmodels] [INFO] [generate.py-1055] : ------total req num: 1, infer start-------- [2025-07-24 14:51:02,086] [925] [281473776835520] [llmmodels] [INFO] [cache.py-153] : kv cache will allocate 0.052734375GB memory [2025-07-24 14:51:03,170] [924] [281473518373824] [llmmodels] [INFO] [flash_causal_qwen2.py-556] : <<<<<<< ori k_caches[0].shape=torch.Size([9, 16, 128, 16]) [2025-07-24 14:51:03,172] [924] [281473518373824] [llmmodels] [INFO] [flash_causal_qwen2.py-560] : <<<<<<<after transdata k_caches[0].shape=torch.Size([9, 16, 128, 16]) [2025-07-24 14:51:03,173] [924] [281473518373824] [llmmodels] [INFO] [flash_causal_qwen2.py-580] : >>>>>>id of kcache is 281470384633424 id of vcache is 281470384633520 [2025-07-24 14:51:03,173] [924] [281473518373824] [llmmodels] [INFO] [flash_causal_lm.py-498] : flash_causal_lm reset: True Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 547, in <module> pa_runner.warm_up() File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 273, in warm_up generate_req(req_list, self.model, self.max_batch_size, self.max_prefill_tokens, self.cache_manager) File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 1143, in generate_req generate_token_with_clocking(model, cache_manager, batch) File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 810, in generate_token_with_clocking res = generate_token(model, cache_manager, input_batch_in)  File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 587, in generate_token logits = model.forward(  File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 297, in forward res = self.model.forward(**kwargs)  File "/usr/local/Ascend/atb-models/atb_llm/models/base/flash_causal_lm.py", line 493, in forward self.init_ascend_weight() File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2/flash_causal_qwen2.py", line 282, in init_ascend_weight self.acl_encoder_operation.set_param(json.dumps({**encoder_param})) RuntimeError: External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 547, in <module> pa_runner.warm_up() File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 273, in warm_up generate_req(req_list, self.model, self.max_batch_size, self.max_prefill_tokens, self.cache_manager) File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 1143, in generate_req generate_token_with_clocking(model, cache_manager, batch) File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 810, in generate_token_with_clocking res = generate_token(model, cache_manager, input_batch_in)  File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 587, in generate_token logits = model.forward(  File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 297, in forward res = self.model.forward(**kwargs)  File "/usr/local/Ascend/atb-models/atb_llm/models/base/flash_causal_lm.py", line 493, in forward self.init_ascend_weight() File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2/flash_causal_qwen2.py", line 282, in init_ascend_weight self.acl_encoder_operation.set_param(json.dumps({**encoder_param})) RuntimeError: External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. [2025-07-24 14:51:03,333] [925] [281473776835520] [llmmodels] [INFO] [flash_causal_qwen2.py-560] : <<<<<<<after transdata k_caches[0].shape=torch.Size([9, 16, 128, 16]) [ERROR] 2025-07-24-14:51:14 (PID:926, Device:2, RankID:-1) ERR99999 UNKNOWN application exception [ERROR] 2025-07-24-14:51:14 (PID:927, Device:3, RankID:-1) ERR99999 UNKNOWN application exception [2025-07-24 14:51:25,678] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 924 closing signal SIGTERM [2025-07-24 14:51:25,678] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 925 closing signal SIGTERM [2025-07-24 14:51:26,042] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 926) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in <module> sys.exit(main())  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs)  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib64/python3.11/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args))  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ examples.run_pa FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-07-24_14:51:25 host : bm-node03 rank : 3 (local_rank: 3) exitcode : 1 (pid: 927) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-07-24_14:51:25 host : bm-node03 rank : 2 (local_rank: 2) exitcode : 1 (pid: 926) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [root@bm-node03 atb-models]# /usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' https://support.huawei.com/enterprise/zh/doc/EDOC1100468889/788a553e?idPath=23710424|251366513|254884019|261408772|252764743设置所有卡BAR空间拷贝使能重启OS生效

---

## 技术要点总结

基于以上内容，主要技术要点包括：

1. **问题类型**: 错误处理
2. **涉及技术**: MindIE, HTTPS, PyTorch, NPU, Docker, Atlas, 昇腾, AI
3. **解决方案**: 请参考完整内容中的解决方案

## 相关资源

- 昇腾社区: https://www.hiascend.com/
- 昇腾论坛: https://www.hiascend.com/forum/

---

*本文档由AI自动生成，仅供参考。如有疑问，请参考原始帖子。*