CANN/cannbot-skills：npugraph_ex后端使用指南-创锋一号

npugraph_ex 后端使用指南

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

npugraph_ex 是基于 torch.compile 的 aclgraph 图模式加速方案，通过捕获模式 (Capture & Replay) 构建模型运行实例。

适用场景

在线推理场景：追求简单快速适配
熟悉 CUDAGraph 模式：使用习惯类似
LLM Decode 阶段：固定 shape 的单 token 输入

使用约束

约束项	说明
PyTorch版本	需要 2.6.0 及以上版本
支持场景	在线推理场景，不支持反向流程 capture
随机数算子	不支持 capture（randn、dropout 等）
动态控制流	不支持，需保证图静态
Stream同步	不支持 stream sync 操作
成熟度	试验特性，暂不支持商用产品

快速上手

import torch import torch_npu model = YourModel().npu() opt_model = torch.compile(model, backend="npugraph_ex", fullgraph=True, dynamic=False) output = opt_model(input_tensor)

详细说明见 npugraph_ex 快速上手

问题定界流程

问题发生 │ ├─→ aot_eager 验证 ──失败──→ 修复用户脚本 │ ↓ 正常 │ ├─→ force_eager=True ──失败──→ 修复用户脚本 │ ↓ 正常 │ └─→ npugraph_ex/FX图问题

options 配置速查

opt_model = torch.compile( model, backend="npugraph_ex", fullgraph=True, options={ # ========== 调试 ========== "force_eager": False, # 强制 eager 模式调试 # ========== FX图优化 ========== "inplace_pass": True, # 原地操作优化 "input_inplace_pass": True, # 输入原地优化 "pattern_fusion_pass": True, # 算子融合 # ========== 内存优化 ========== "reuse_graph_pool_in_same_fx": True, # 图池复用 "clone_input": True, # 克隆输入 "clone_output": False, # 克隆输出 "use_graph_pool": None, # 图池配置 # ========== 性能优化 ========== "static_kernel_compile": False, # 静态Kernel编译 "remove_noop_ops": True, # 移除空操作 "frozen_parameter": False, # 冻结参数 # ========== 捕获控制 ========== "capture_limit": 64, # 重捕获次数限制 } )

核心 API

API	用途	文档路径
`compile_fx()`	自定义 backend	compile_fx.md
`register_replacement()`	自定义算子融合	register_replacement.md
`cache_compile()`	编译缓存	cache_compile.md
`limit_core_num()`	限核功能	limit_core_num.md

功能文档索引

在线文档：gitcode.com/Ascend/torchair/docs/zh/npugraph_ex

基础功能

功能	在线链接
npugraph_ex 后端概述	npugraph_ex.md
快速上手	quick_start.md
基础功能总览	basic/
force_eager	force_eager.md
FX图优化Pass配置	inplace_pass.md
内存复用	memory_reuse.md
静态Kernel编译	static_kernel_compile.md
冗余算子消除	remove_noop_ops.md
重捕获次数限制	capture_limit.md
集合通信入图	communication_graph.md

进阶功能

功能	在线链接
进阶功能总览	advanced/
模型编译缓存	compile_cache.md
多流表达功能	multi_stream.md
AI Core/Vector Core 限核	limit_cores.md
自定义FX图优化Pass	post_grad_custom_pass.md

DFX 功能

功能	在线链接
DFX 总览	dfx/
图编译Debug信息保存	debug_save.md
算子Data Dump	data_dump.md

API 参考

API	在线链接
API 总览	api/
API 列表	api_list.md
torch.npu.npugraph_ex	torch-npu-npugraph_ex.md
compile_fx	compile_fx.md
register_replacement	register_replacement.md
inference (cache_compile, readable_cache)	inference/
scope (limit_core_num)	scope/

常见问题

1. 如何判断是否应该使用 npugraph_ex？

适合：LLM decode 阶段、固定 shape 推理、简单快速适配
不适合：需要动态 shape、生产环境稳定性优先、训练场景

2. 报错 "不支持 capture" 怎么办？

检查代码中是否包含：

随机数算子（randn、dropout）
动态控制流（if/while 基于 tensor 值）
.item()调用

3. 性能劣化怎么办？

开启重编译日志：torch._logging.set_logs(recompiles=True)
检查是否发生重编译
参考llm-model-guide.md中的改造指南

企业官网建设流程全解析

npugraph_ex 后端使用指南

适用场景

使用约束

快速上手

问题定界流程

options 配置速查

核心 API

功能文档索引

基础功能

进阶功能

DFX 功能

API 参考

常见问题

1. 如何判断是否应该使用 npugraph_ex？

2. 报错 "不支持 capture" 怎么办？

3. 性能劣化怎么办？

相关文档

热门文章

文章分类

标签云

需要专业的网站建设服务？

企业官网建设流程全解析

npugraph_ex 后端使用指南

适用场景

使用约束

快速上手

问题定界流程

options 配置速查

核心 API

功能文档索引

基础功能

进阶功能

DFX 功能

API 参考

常见问题

1. 如何判断是否应该使用 npugraph_ex？

2. 报错 "不支持 capture" 怎么办？

3. 性能劣化怎么办？

相关文档

热门文章

文章分类

标签云

相关文章

昇思大模型推理评估指标详解

泰山派3M-RK3576-Ai应用-YOLO11-分割模型

CANN/shmem调度GMM组合示例

需要专业的网站建设服务？