reduce_max/reduce_sum dims=[1] returns zeros for fp32 tiles (fp32_reduce_acc path)

## Summary
`ttl.math.reduce_max` and `ttl.math.reduce_sum` with `dims=[1]` silently return all zeros when the input tile is `fp32`. The same kernel on `bfloat16` tiles returns correct per-row values. `dims=[0]` and `dims=[0, 1]` work correctly in fp32; only `dims=[1]` is affected.

The bug is in the fp32-accumulation lowering for reduce: passing `options=\"--no-ttl-reduce-full-fp32\"` to `@ttl.operation` makes fp32 `dims=[1]` return correct values.

## Environment
- ttlang 1.0.0.dev4 (from the Python error traceback)
- Blackhole hardware (`bh-qbae-15` / `sterling-all` container)
- `layout=ttnn.TILE_LAYOUT`, tested with both `DRAM_MEMORY_CONFIG` and `L1_MEMORY_CONFIG`

## Minimal repro

```python
import torch, ttnn, ttl
TILE = 32

@ttl.operation(grid=(1, 1))
def reduce_max_kernel(x, scaler, out):
    x_dfb   = ttl.make_dataflow_buffer_like(x,      shape=(1, 1), block_count=2)
    sc_dfb  = ttl.make_dataflow_buffer_like(scaler, shape=(1, 1), block_count=1)
    red_dfb = ttl.make_dataflow_buffer_like(x,      shape=(1, 1), block_count=2)
    out_dfb = ttl.make_dataflow_buffer_like(out,    shape=(1, 1), block_count=2)

    @ttl.compute()
    def compute():
        sc = sc_dfb.wait(); xb = x_dfb.wait()
        red_dfb.reserve().store(ttl.math.reduce_max(xb, sc, dims=[1]))
        mb = out_dfb.reserve()
        mb.store(ttl.math.broadcast(red_dfb.wait(), mb, dims=[1]))

    @ttl.datamovement()
    def dm_read():
        ttl.copy(scaler[0, 0], sc_dfb.reserve()).wait()
        ttl.copy(x[0, 0],      x_dfb.reserve()).wait()

    @ttl.datamovement()
    def dm_write():
        ttl.copy(out_dfb.wait(), out[0, 0]).wait()

# Input: row r filled with value r -> row max should be r.
for dtype_t, dtype_nn in [(torch.bfloat16, ttnn.bfloat16), (torch.float32, ttnn.float32)]:
    x  = torch.zeros(TILE, TILE, dtype=dtype_t)
    for r in range(TILE): x[r, :] = float(r)
    sc  = torch.ones(TILE, TILE, dtype=dtype_t)
    out = torch.zeros_like(x)
    cfg = dict(dtype=dtype_nn, layout=ttnn.TILE_LAYOUT,
               memory_config=ttnn.DRAM_MEMORY_CONFIG)
    device = ttnn.open_device(device_id=0)
    x_tt  = ttnn.from_torch(x,  device=device, **cfg)
    sc_tt = ttnn.from_torch(sc, device=device, **cfg)
    out_tt = ttnn.from_torch(out, device=device, **cfg)
    reduce_max_kernel(x_tt, sc_tt, out_tt)
    print(dtype_t, ttnn.to_torch(out_tt).float()[:4, 0])
    ttnn.close_device(device)
```

## Expected vs observed

| dtype | `out[:, 0]` first 8 |
|---|---|
| bf16 | `[0, 1, 2, 3, 4, 5, 6, 7]` (correct) |
| **fp32** | **`[0, 0, 0, 0, 0, 0, 0, 0]`** (wrong) |

The same pattern exercised in `test/python/simple_reduce_bcast.py` works fine but uses bf16 + `dims=[0, 1]`, so does not exercise this path.

## Knobs tried

| Config | fp32 `dims=[1]` result |
|---|---|
| default | zeros |
| `fp32_dest_acc_en=True` | zeros |
| `fp32_dest_acc_en=True, dst_full_sync_en=False` | zeros |
| `L1_MEMORY_CONFIG` instead of DRAM | zeros |
| `options=\"--no-ttl-maximize-dst\"` | zeros |
| **`options=\"--no-ttl-reduce-full-fp32\"`** | **correct** |

Nested-with layout (matching `simple_reduce_bcast.py`) vs flat layout makes no difference; only `--no-ttl-reduce-full-fp32` changes the outcome.

## Scope
- `reduce_max` and `reduce_sum` both affected.
- Only `dims=[1]` on fp32 tiles. `dims=[0]` and `dims=[0, 1]` are correct.
- bf16 tiles unaffected regardless of dims.

## Workaround
Pass `options=\"--no-ttl-reduce-full-fp32\"` to `@ttl.operation` until the fp32-accumulation reduce lowering is fixed for `dims=[1]`.

dtype	`out[:, 0]` first 8
bf16	`[0, 1, 2, 3, 4, 5, 6, 7]` (correct)
fp32	`[0, 0, 0, 0, 0, 0, 0, 0]` (wrong)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce_max/reduce_sum dims=[1] returns zeros for fp32 tiles (fp32_reduce_acc path) #533

Summary

Environment

Minimal repro

Expected vs observed

Knobs tried

Scope

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Config	fp32 `dims=[1]` result
default	zeros
`fp32_dest_acc_en=True`	zeros
`fp32_dest_acc_en=True, dst_full_sync_en=False`	zeros
`L1_MEMORY_CONFIG` instead of DRAM	zeros
`options=\"--no-ttl-maximize-dst\"`	zeros
`options=\"--no-ttl-reduce-full-fp32\"`	correct

reduce_max/reduce_sum dims=[1] returns zeros for fp32 tiles (fp32_reduce_acc path) #533

Description

Summary

Environment

Minimal repro

Expected vs observed

Knobs tried

Scope

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions