CANN/cannbot-skills A2双桥模式
2026/5/9 13:27:33 网站建设 项目流程

a2 Cube-to-Vec-to-Cube Pattern (Double GM Bridge, One-Tile Lookahead)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing an a2 (easyasc.a2, deviceb3) kernel with:

  • one cube stage that produces a tile
  • vec logic that transforms that tile
  • a later cube stage that consumes the vec result

Do not use this file for a5 kernels. The a5 path is materially different because it can publish vec output directly toL1.

When to use

  • the formula is structurallycube -> vec -> cube
  • the vec result must be consumed by a later cube matmul
  • the later cube stage naturally runs one iteration behind the producer stage
  • the vec output is large enough that pretending it stays purely on chip would be misleading

Typical example:

  • score_j = q @ k_j^T
  • vec computesp_j = exp(score_j - running_max).half()
  • delayed cube stage computespv_j = p_j.float() @ v_j.float()

Why a2 needs a special pattern

Two a2 hardware/software constraints dominate the design:

  1. l0c_to_ubis unavailable. Cube output cannot go directly fromL0CtoUB.

  2. ub_to_l1_nd2nz/ub_to_l1_nzare a5-only. Vec output cannot go directly fromUBtoL1.

Practical consequence:

  • cube -> vec must bridge through GM workspace
  • vec -> cube must also bridge through GM workspace

So the real a2 flow is:

GM(q,k) -> L1 -> L0 -> L0C -> GM(score_ws) -> UB -> vec -> GM(p_ws) -> L1 -> L0 -> L0C -> GM(pv)

This is the core difference from a5.

Stable schedule: warmup, steady state, drain

The clean control structure is a one-tile lookahead loop:

for ni in range(0, tiles_n + 1): if ni < tiles_n: # stage 1: produce current tile p_j if ni > 0: # stage 2: consume previous tile p_{j-1}

Meaning:

  • ni < tiles_n: produce tilej = ni
  • ni > 0: consume tilej = ni - 1

This creates:

  • warmup: first iteration only produces
  • steady state: middle iterations producejwhile consumingj-1
  • drain: last iteration only consumes the final tile

Do not force both stages into the same tile index inside one iteration. The delayed consumer is the point of the pattern.

Workspace layout

Use two separate GM workspaces:

  1. score_ws

    • dtype:float
    • shape:[GetCubeNum(), 2, TILE_M, TILE_N]
    • purpose: bridgeL0C(score)->UB
  2. p_ws

    • dtype:half
    • shape:[GetCubeNum(), 2, TILE_M, TILE_N]
    • purpose: bridgeUB(p_j)->L1

Why two workspaces:

  • stage-1 score is naturallyfloat
  • stage-2 cube input should consumehalfif the target contract isp_j.half().float() @ v_j.float()
  • keeping them separate makes dtype intent explicit and avoids hidden casts

Buffer ownership and reuse

1. Reuse oneL0Cfamily across both cube stages

On a2,TILE_M = TILE_N = 128with float accumulation already fills the entire128 KBL0C. That leaves no room for a second full-sizeL0Cfamily.

Stable rule for this specific pattern:

  • reuse one physicall0c = DBuff(DT.float, [TILE_M, TILE_N], Position.L0C)
  • let both cube stages write into that same family
  • advance one sharedl0c_cnt

Why this is safe here:

  • stage 1 and stage 2 do not needL0Csimultaneously
  • stage 1 publishesL0C -> score_wsbefore stage 2 reuses the slot
  • stage 2 publishesL0C -> outputbefore the next stage-1 reuse

Do not generalize this into "all delayed stages should share counters". This is a targeted capacity-driven exception for one serially reusedL0Cfamily.

2. Keep other lifetimes separate

Even thoughl0c_cntis shared, other stage-owned lifetimes should stay separate:

  • l1q/l1kandl1p/l1vshould not share one counter
  • delayed slot ownership should usestage1_cntandstage2_cnt

Recommended split:

  • l1qk_cnt: stage-1 operand loads
  • l1pv_cnt: stage-2 operand loads
  • l0c_cnt: shared physicalL0Cfamily
  • stage1_cnt:score_ws/p_wsproducer slot rhythm
  • stage2_cnt: delayed consumer slot rhythm

3. If a delayed consumer reuses a producer operand, match buffer depth to the overlap

Sometimes the delayed cube stage needs not only the vec result, but also one of the original stage-1 operands again.

Concrete example from dense attention backward:

  • stage 1 loadsk_jand computesqk_j = q @ k_j^T
  • vec computesdqk_j
  • delayed cube stage later computesgq += dqk_j @ k_j

If you want to avoid reloadingk_jfrom GM, keep that operand family on chip and reuse it from the delayed stage.

Important overlap rule:

  • for a one-tile lookahead loop,DBuffis oftennotenough for a reused producer operand
  • while the delayed stage is still consuming tilej, the producer may already be starting tilej+2
  • with only two slots, tilej+2can overwrite slotjbefore the delayed consumer is done

Stable rule for this case:

  • promote only the reused delayed operand family toTBuff
  • keep unrelated families such asvonDBuffif they are not reused by the delayed stage
  • let the delayed consumer index thatTBuffby its own delayed-stage lineage, not by the immediate producer slot

Practical outcome:

  • kmay needTBuff
  • vmay still stayDBuff
  • the extra on-chip slot can be cheaper than a second GM read on every tile

This is a lifetime decision, not a micro-optimization accident. Choose the buffer depth from the real overlap window.

Cross-side synchronization

This pattern has two ownership edges.

Edge 1: cube -> vec (score)

Use:

CvMutex(0, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)

Reason:

  • producer ends withl0c_to_gm_nz2ndonFIX
  • vec consumer starts withgm_to_ub_padonMTE2

Edge 2: vec -> cube (p_j)

Use:

VcMutex(1, src_end_pipe=Pipe.MTE3, dst_end_pipe=Pipe.FIX)

Reason:

  • vec producer ends withub_to_gm_padonMTE3
  • cube consumer eventually finishes the delayed use aftergm_to_l1 -> l1_to_l0 -> mmad -> writeback
  • for this pattern, conservative release is safer: free only after the cube stage finishes the tile

This conservativedst_end_pipe=Pipe.FIXmatches the "do not release early" rule for delayed reuse.

Two-sub-block publication rule

Each a2 cube core has 2 vec sub-blocks. Each vec sub-block owns onlyHALF_Mrows inUB.

So stage 1 should:

  • readHALF_Mrows fromscore_ws
  • computep_jfor only those rows
  • write those rows into the sharedp_wsslot

Typical write pattern:

sb = GetSubBlockIdx() sb_row = Var(sb * HALF_M) p_ws[cube_idx, slot, sb_row:sb_row + HALF_M, 0:TILE_N] <<= ub_p

Then stage 2 cube waits on theVcMutexand reads the full tile:

l1p[...] <<= p_ws[cube_idx, slot, 0:TILE_M, 0:TILE_N]

Important simulator/runtime fact:

  • cube-sidewait_vec()completes only after both vec lanes have produced their tokens
  • this makes the full-tile read safe without an extra manual barrier

Row-max state rules

If the vec stage uses running row max across tiles:

  • keep the running state in[HALF_M, 1]scalar format
  • initialize withdup(neg_large)whereneg_largeis a sufficiently large finite negative sentinel
  • update withvmax(ub_rmax_s, ub_rmax_s, ub_max_s)
  • broadcast only after the scalar update usingbrcb

ForTILE_N = 128, the stable sequence is:

  1. vmaxbetween the two 64-column halves
  2. cmaxto[HALF_M, 1]
  3. vmaxwith running state in[HALF_M, 1]
  4. brcbto[HALF_M, 8]
  5. slicedsubon[0:64]and[64:128]
  6. exp
  7. casttohalf

Do not:

  • update running max in[HALF_M, 8]broadcast format
  • subtract a narrow max buffer from an unsliced[HALF_M, 128]tile

Stage ordering inside one loop iteration

For this a2 pattern, a stable order is:

  1. stage 1 cube computesscore_j
  2. stage 1 vec computesp_jand writesp_ws
  3. stage 2 cube consumes delayedp_{j-1}

In other words: "produce current tile first, then consume previous tile".

Why this order is helpful:

  • the reusedL0Cfamily is naturally free afterscore_j -> score_ws
  • the delayed cube stage can then reuse that sameL0Cfamily safely
  • one sharedl0c_cntremains easy to reason about

If the delayed stage also reuses stage-1k_jon chip:

  • the schedule is still "producej, then consumej-1"
  • but thekbuffer family now lives longer than the immediatevfamily
  • reflect that longer lifetime in the buffer depth (TBuff) and in the counter choice

Output layout rule

For flattened GM output that preserves[B, H, tiles_n, S1, D], a stable write index is:

out_row = Var((bh * tiles_n + tile_n) * S1 + local_row)

That corresponds to the physical layout:

[(bh * n_tiles + tile_n) * S1 + row, D]

Use this when the user wants to preserve the logical[B, H, tile_n, S1, D]grouping while still flatteningBHin the kernel contract.

Validation checklist

For the first runnable version, keep the contract narrow and explicit:

  • S1 % 128 == 0
  • S2 % 128 == 0
  • D == 128
  • scalepassed in as an explicit kernel scalar

Reference formula to compare against:

for j in range(0, S2, 128): score_j = q.float() @ k_j.float().t() * scale m = maximum(m, rowmax(score_j)) p_j = exp(score_j - m).half() pv_j = p_j.float() @ v_j.float()

Good first validation order:

  1. (B,H,S1,S2,D) = (1,1,256,512,128)
  2. multi-head small case
  3. full aligned case such as(1,3,2048,4096,128)

Common mistakes

  • trying to useUB -> L1directly on a2
  • allocating separate full-sizeL0Cfamilies for both cube stages
  • sharing every counter just becausel0c_cntis shared
  • forgetting thetiles_n + 1warmup/drain loop
  • consuming tilejin the same iteration that is supposed to produce tilej
  • writing only one vec sub-block's rows intop_ws
  • releasing the vec -> cube mutex before the delayed cube stage really finishes
  • documenting the kernel as "online softmax" when it only keeps running max and does not maintain running sum

Files to study

  • agent/example/kernels/a2/flash_attn_score_pv.py
  • agent/example/kernels/a2/flash_attn_score_iter.py
  • agent/references/patterns/a2-cube-vec.md
  • agent/references/constraints/a2-device.md
  • agent/references/constraints/vec-reduction-a2.md
  • agent/references/constraints/vec-stride.md

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询