Summary
ttl.math.reduce_max and ttl.math.reduce_sum with dims=[1] silently return all zeros when the input tile is fp32. The same kernel on bfloat16 tiles returns correct per-row values. dims=[0] and dims=[0, 1] work correctly in fp32; only dims=[1] is affected.
The bug is in the fp32-accumulation lowering for reduce: passing options=\"--no-ttl-reduce-full-fp32\" to @ttl.operation makes fp32 dims=[1] return correct values.
Environment
- ttlang 1.0.0.dev4 (from the Python error traceback)
- Blackhole hardware (
bh-qbae-15 / sterling-all container)
layout=ttnn.TILE_LAYOUT, tested with both DRAM_MEMORY_CONFIG and L1_MEMORY_CONFIG
Minimal repro
import torch, ttnn, ttl
TILE = 32
@ttl.operation(grid=(1, 1))
def reduce_max_kernel(x, scaler, out):
x_dfb = ttl.make_dataflow_buffer_like(x, shape=(1, 1), block_count=2)
sc_dfb = ttl.make_dataflow_buffer_like(scaler, shape=(1, 1), block_count=1)
red_dfb = ttl.make_dataflow_buffer_like(x, shape=(1, 1), block_count=2)
out_dfb = ttl.make_dataflow_buffer_like(out, shape=(1, 1), block_count=2)
@ttl.compute()
def compute():
sc = sc_dfb.wait(); xb = x_dfb.wait()
red_dfb.reserve().store(ttl.math.reduce_max(xb, sc, dims=[1]))
mb = out_dfb.reserve()
mb.store(ttl.math.broadcast(red_dfb.wait(), mb, dims=[1]))
@ttl.datamovement()
def dm_read():
ttl.copy(scaler[0, 0], sc_dfb.reserve()).wait()
ttl.copy(x[0, 0], x_dfb.reserve()).wait()
@ttl.datamovement()
def dm_write():
ttl.copy(out_dfb.wait(), out[0, 0]).wait()
# Input: row r filled with value r -> row max should be r.
for dtype_t, dtype_nn in [(torch.bfloat16, ttnn.bfloat16), (torch.float32, ttnn.float32)]:
x = torch.zeros(TILE, TILE, dtype=dtype_t)
for r in range(TILE): x[r, :] = float(r)
sc = torch.ones(TILE, TILE, dtype=dtype_t)
out = torch.zeros_like(x)
cfg = dict(dtype=dtype_nn, layout=ttnn.TILE_LAYOUT,
memory_config=ttnn.DRAM_MEMORY_CONFIG)
device = ttnn.open_device(device_id=0)
x_tt = ttnn.from_torch(x, device=device, **cfg)
sc_tt = ttnn.from_torch(sc, device=device, **cfg)
out_tt = ttnn.from_torch(out, device=device, **cfg)
reduce_max_kernel(x_tt, sc_tt, out_tt)
print(dtype_t, ttnn.to_torch(out_tt).float()[:4, 0])
ttnn.close_device(device)
Expected vs observed
| dtype |
out[:, 0] first 8 |
| bf16 |
[0, 1, 2, 3, 4, 5, 6, 7] (correct) |
| fp32 |
[0, 0, 0, 0, 0, 0, 0, 0] (wrong) |
The same pattern exercised in test/python/simple_reduce_bcast.py works fine but uses bf16 + dims=[0, 1], so does not exercise this path.
Knobs tried
| Config |
fp32 dims=[1] result |
| default |
zeros |
fp32_dest_acc_en=True |
zeros |
fp32_dest_acc_en=True, dst_full_sync_en=False |
zeros |
L1_MEMORY_CONFIG instead of DRAM |
zeros |
options=\"--no-ttl-maximize-dst\" |
zeros |
options=\"--no-ttl-reduce-full-fp32\" |
correct |
Nested-with layout (matching simple_reduce_bcast.py) vs flat layout makes no difference; only --no-ttl-reduce-full-fp32 changes the outcome.
Scope
reduce_max and reduce_sum both affected.
- Only
dims=[1] on fp32 tiles. dims=[0] and dims=[0, 1] are correct.
- bf16 tiles unaffected regardless of dims.
Workaround
Pass options=\"--no-ttl-reduce-full-fp32\" to @ttl.operation until the fp32-accumulation reduce lowering is fixed for dims=[1].
Summary
ttl.math.reduce_maxandttl.math.reduce_sumwithdims=[1]silently return all zeros when the input tile isfp32. The same kernel onbfloat16tiles returns correct per-row values.dims=[0]anddims=[0, 1]work correctly in fp32; onlydims=[1]is affected.The bug is in the fp32-accumulation lowering for reduce: passing
options=\"--no-ttl-reduce-full-fp32\"to@ttl.operationmakes fp32dims=[1]return correct values.Environment
bh-qbae-15/sterling-allcontainer)layout=ttnn.TILE_LAYOUT, tested with bothDRAM_MEMORY_CONFIGandL1_MEMORY_CONFIGMinimal repro
Expected vs observed
out[:, 0]first 8[0, 1, 2, 3, 4, 5, 6, 7](correct)[0, 0, 0, 0, 0, 0, 0, 0](wrong)The same pattern exercised in
test/python/simple_reduce_bcast.pyworks fine but uses bf16 +dims=[0, 1], so does not exercise this path.Knobs tried
dims=[1]resultfp32_dest_acc_en=Truefp32_dest_acc_en=True, dst_full_sync_en=FalseL1_MEMORY_CONFIGinstead of DRAMoptions=\"--no-ttl-maximize-dst\"options=\"--no-ttl-reduce-full-fp32\"Nested-with layout (matching
simple_reduce_bcast.py) vs flat layout makes no difference; only--no-ttl-reduce-full-fp32changes the outcome.Scope
reduce_maxandreduce_sumboth affected.dims=[1]on fp32 tiles.dims=[0]anddims=[0, 1]are correct.Workaround
Pass
options=\"--no-ttl-reduce-full-fp32\"to@ttl.operationuntil the fp32-accumulation reduce lowering is fixed fordims=[1].