Skip to content

reduce_max/reduce_sum dims=[1] returns zeros for fp32 tiles (fp32_reduce_acc path) #533

@zoecarver

Description

@zoecarver

Summary

ttl.math.reduce_max and ttl.math.reduce_sum with dims=[1] silently return all zeros when the input tile is fp32. The same kernel on bfloat16 tiles returns correct per-row values. dims=[0] and dims=[0, 1] work correctly in fp32; only dims=[1] is affected.

The bug is in the fp32-accumulation lowering for reduce: passing options=\"--no-ttl-reduce-full-fp32\" to @ttl.operation makes fp32 dims=[1] return correct values.

Environment

  • ttlang 1.0.0.dev4 (from the Python error traceback)
  • Blackhole hardware (bh-qbae-15 / sterling-all container)
  • layout=ttnn.TILE_LAYOUT, tested with both DRAM_MEMORY_CONFIG and L1_MEMORY_CONFIG

Minimal repro

import torch, ttnn, ttl
TILE = 32

@ttl.operation(grid=(1, 1))
def reduce_max_kernel(x, scaler, out):
    x_dfb   = ttl.make_dataflow_buffer_like(x,      shape=(1, 1), block_count=2)
    sc_dfb  = ttl.make_dataflow_buffer_like(scaler, shape=(1, 1), block_count=1)
    red_dfb = ttl.make_dataflow_buffer_like(x,      shape=(1, 1), block_count=2)
    out_dfb = ttl.make_dataflow_buffer_like(out,    shape=(1, 1), block_count=2)

    @ttl.compute()
    def compute():
        sc = sc_dfb.wait(); xb = x_dfb.wait()
        red_dfb.reserve().store(ttl.math.reduce_max(xb, sc, dims=[1]))
        mb = out_dfb.reserve()
        mb.store(ttl.math.broadcast(red_dfb.wait(), mb, dims=[1]))

    @ttl.datamovement()
    def dm_read():
        ttl.copy(scaler[0, 0], sc_dfb.reserve()).wait()
        ttl.copy(x[0, 0],      x_dfb.reserve()).wait()

    @ttl.datamovement()
    def dm_write():
        ttl.copy(out_dfb.wait(), out[0, 0]).wait()

# Input: row r filled with value r -> row max should be r.
for dtype_t, dtype_nn in [(torch.bfloat16, ttnn.bfloat16), (torch.float32, ttnn.float32)]:
    x  = torch.zeros(TILE, TILE, dtype=dtype_t)
    for r in range(TILE): x[r, :] = float(r)
    sc  = torch.ones(TILE, TILE, dtype=dtype_t)
    out = torch.zeros_like(x)
    cfg = dict(dtype=dtype_nn, layout=ttnn.TILE_LAYOUT,
               memory_config=ttnn.DRAM_MEMORY_CONFIG)
    device = ttnn.open_device(device_id=0)
    x_tt  = ttnn.from_torch(x,  device=device, **cfg)
    sc_tt = ttnn.from_torch(sc, device=device, **cfg)
    out_tt = ttnn.from_torch(out, device=device, **cfg)
    reduce_max_kernel(x_tt, sc_tt, out_tt)
    print(dtype_t, ttnn.to_torch(out_tt).float()[:4, 0])
    ttnn.close_device(device)

Expected vs observed

dtype out[:, 0] first 8
bf16 [0, 1, 2, 3, 4, 5, 6, 7] (correct)
fp32 [0, 0, 0, 0, 0, 0, 0, 0] (wrong)

The same pattern exercised in test/python/simple_reduce_bcast.py works fine but uses bf16 + dims=[0, 1], so does not exercise this path.

Knobs tried

Config fp32 dims=[1] result
default zeros
fp32_dest_acc_en=True zeros
fp32_dest_acc_en=True, dst_full_sync_en=False zeros
L1_MEMORY_CONFIG instead of DRAM zeros
options=\"--no-ttl-maximize-dst\" zeros
options=\"--no-ttl-reduce-full-fp32\" correct

Nested-with layout (matching simple_reduce_bcast.py) vs flat layout makes no difference; only --no-ttl-reduce-full-fp32 changes the outcome.

Scope

  • reduce_max and reduce_sum both affected.
  • Only dims=[1] on fp32 tiles. dims=[0] and dims=[0, 1] are correct.
  • bf16 tiles unaffected regardless of dims.

Workaround

Pass options=\"--no-ttl-reduce-full-fp32\" to @ttl.operation until the fp32-accumulation reduce lowering is fixed for dims=[1].

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions