Planner pushdowns: LHead, HAVING→WHERE, multi-key Top-N, row-group ordering by whilo · Pull Request #19 · replikativ/stratum

whilo · 2026-05-08T16:46:28Z

Summary

Four local pushdown rewrites, each mirroring a corresponding DuckDB optimizer pass:

#	Pass	DuckDB analog
F11	`head-rewrite`: bare `LIMIT N` → `LHead` (first-N rows from scan, no full materialize)	`PhysicalStreamingLimit` + `LimitPushdown`
F12	`having-to-where-pushdown`: `HAVING g > x` (group col) → `WHERE g > x`	`FilterPushdown::PushdownAggregate`
F13	Multi-key Top-N (was single-column only)	`PhysicalTopN` with `vector<BoundOrderByNode>`
F14	Row-group ordering: visit chunks in primary-key min/max order, short-circuit when next chunk can't dethrone heap	`RowGroupPruner::set_scan_order`

All four close concrete user-visible gaps. F11 fixes Ryan's SELECT * FROM big_table LIMIT 3 Parquet OOM (reported in #stratum-clojure). F12 lets composite HAVING filter rows pre-aggregate. F13 unblocks the most common ORDER BY a, b LIMIT N shape. F14 turns sorted timestamps / append-only logs into a single-chunk hit.

Verified

F11: 5M × 8-column index, SELECT * LIMIT 3 warm = 6.7 ms (was OOM at 4 GB heap, 30+ s stall at 16 GB).
F12: pushed plan PDenseGroupBy → PScan (no PHaving wrapper); pushed result = legacy result.
F13: 6-row mixed ASC/DESC produces {1,50},{1,20},{2,60},{2,30},{3,40},{3,10} (expected interleaved order); 50K-row index multi-key matches naive sort+limit.
F14: 5M sorted ASC LIMIT 10 = 1.04 ms vs 4.18 ms on random — 4× speedup from chunk pruning.

Test plan

935 tests / 4292 assertions green
14 new unit tests across test/stratum/limit_pushdown_test.clj
REPL end-to-end on 5M-row datasets (sorted, random, multi-key, mixed-direction)
clojure -M:ffix

Architecture notes

LHead and LTopN are deliberately parallel: same [limit select input] shape (LTopN adds order-specs); both peel an LProject through peel-project; both fetch only surviving rows; both run before column-pruning.
Multi-key sort-key uses a ^doubles array (1 element per ORDER BY column). Cast int64→double loses precision only for values > 2^53 (typical timestamps in millis are far below this). DuckDB packs into string_t (radix-encoded blob) for proper byte-comparison; if precision becomes an issue, stratum can move to the same encoding without further IR changes.
F14's chunk pruning is consistent regardless of input: sorted data terminates after the first chunk; random data pays only O(K log K) chunk-stats sort overhead, no per-row overhead.

Follow-ups (separate PR)

The architectural pushdowns from the same audit — generic pushdown-target walker (DuckDB's GetPushdownFilterTargets), filter+limit early-exit (push-pipeline + FINISHED propagation), late materialization for SELECT * ORDER BY x LIMIT N over wide tables, dynamic filter from join build-side to probe scan — share infrastructure (per-IR-node order-preservation metadata) and warrant a single follow-up PR. Tracked locally; will open as a sibling PR.

`SELECT * FROM big_table LIMIT 3` was decoding every row group of every column before applying LIMIT. The materialization budget guard (~70% of heap) blocked the query at small heaps with "Query would materialize ~8 GB"; at 16 GB it ran for tens of seconds spinning on full-column decode. Reported on #stratum-clojure with a Parquet-backed table. Root cause: the planner had no rewrite for `LLimit (LScan)` or `LLimit (LProject ref-only LScan)`. `top-n-rewrite` only fires on LSort+LLimit; bare LIMIT fell through to PProject which calls `materialize-columns` over the full input. Mirroring `LTopN`: - New `LHead [limit select input]` IR record. - New `head-rewrite` plan pass — recognizes `LLimit (LScan)` and `LLimit (LProject ref-only LScan)`, gates on no-WHERE / no- aggregate / no-window / no-OFFSET / `LIMIT <= *head-limit*` (default 100K). Runs before `strategy-selection` and `column-pruning`. - New `execute-head-node` in the executor — reads the first N rows of each referenced column via `cols/take-prefix-column`, decodes dict-encoded strings inline, returns row maps. Mirrors `execute-top-n-node` minus the heap. - `cols/take-prefix-column` materializes only the requested prefix: array → `Arrays/copyOfRange`, index → `index/idx-materialize-to-array-prefix` (new — walks chunks in order and stops once N elements are copied; a single chunk suffices for typical small limits). - `collect-all-refs` walks LHead like LTopN — explicit `:select` items count as live, SELECT * (`:select nil`) keeps every input scan column. Verified end-to-end on a 5M × 8 column index dataset: - `SELECT * FROM t LIMIT 3` warm: 6.7 ms (was OOM at 4 GB heap, stalled for 30+ s at 16 GB heap). - Plan shape: `LHead → PScan` (no PProject, no full materialize). - Eligibility correctly skips +WHERE, +ORDER BY (routes to LTopN), +OFFSET, +LIMIT > threshold. - 924 tests / 4279 assertions green (10 new unit tests for the rewrite + execution).

Three local pushdown rewrites that close concrete ORDER BY / HAVING gaps the audit surfaced. Each mirrors a DuckDB optimizer pass to keep the architecture consistent. F12 — HAVING → WHERE pushdown (filter-through-aggregate) When a HAVING predicate references only group columns (not aggregate aliases), push it below LGroupBy as a pre-aggregate WHERE filter — same pattern as DuckDB's `FilterPushdown::PushdownAggregate`. Eliminates rows before they hit the group-by hash table; matches the constraint-as-early-as- possible principle of the existing predicate-pushdown-through- join pass. Implementation: new `having-to-where-pushdown` pass right after `predicate-pushdown` in `optimize`. Walks the plan, classifies each pred via `pred-columns` (set membership in the GroupBy's `:group-keys`), pushes into a new (or merged) `LFilter` below LGroupBy, keeps the rest on LHaving. Skips LGlobalAgg (no group cols) and LWindow-wrapped LHaving (already covered by `window-having-pushdown`). F13 — Multi-key Top-N `LTopN` previously gated on single-column ORDER BY; multi-key (`ORDER BY a ASC, b DESC LIMIT N`) fell through to materialize + full sort, unnecessarily slow for the common composite-ordering shape. Loosen the gate and rewrite the heap to handle any arity. - `TopNEntry` field `^double key` → `^doubles keys` (one entry per ORDER BY column, populated row-by-row from a reused scratch array; we allocate the permanent in-heap copy only on insert/eviction). - `entry-cmp` takes an `^ints dirs` (`+1` for ASC, `-1` for DESC) and walks keys in declared order, returning the first non-zero per-key result. Mixed asc/desc supported. - `find-top-n-on-array` / `find-top-n-on-index` consolidated into multi-column `find-top-n-on-arrays` / `find-top-n-on-indices`. Single-key callers pass a 1-element vec — same code path, no perf regression. - `top-n-eligible?` and the planner's `top-n-order-eligible?` now accept N order specs (still numeric, still LIMIT bound). - `LTopN`'s field renamed `order-spec` → `order-specs` (vec of `[col dir]` pairs, mirroring DuckDB's `vector<BoundOrderByNode>`). F14 — Row-group ordering for ORDER BY+LIMIT on monotonic columns When the primary ORDER BY column has chunk-level min/max stats (always true for index-backed columns), iterate chunks in primary-order — ASC: ascending min, DESC: descending max — and short-circuit the loop the moment a chunk's primary bound can't beat the heap's worst kept primary key. Mirrors DuckDB's `RowGroupPruner` (`src/optimizer/row_group_pruner.cpp`, `set_scan_order`) but applied to streaming top-N instead of a separate scan-reorder pass. Sorted timestamps / append-only logs hit the early-termination the moment the heap fills (typically after one chunk). Random inputs see no overhead beyond an O(K log K) chunk-stats sort (K = number of chunks, typically ≤ 1000). - `chunk-iteration-order`: permutation of `[0..n-chunks)` sorted by primary stats. - `can-prune-rest?`: after the heap is full, `chunk-min >= heap-worst` (ASC) or `chunk-max <= heap-worst` (DESC) means every remaining chunk is worse — stop. Verified end-to-end: - 5M-row sorted column ORDER BY ASC LIMIT 10: 1.04 ms (vs 4.18 ms on random input — 4× speedup from chunk pruning). - Multi-key ORDER BY produces identical ordering to a naive sort+limit. - Mixed ASC/DESC ordering produces the expected interleaved output. - HAVING(group-col) + HAVING(agg) splits correctly: pushed pred matches WHERE(group-col) + HAVING(agg). - 935 tests / 4292 assertions green (11 new tests across F12-F14).

whilo added 2 commits May 8, 2026 09:22

whilo merged commit bc97f36 into main May 8, 2026
5 of 6 checks passed

whilo deleted the feature/limit-pushdown branch May 8, 2026 17:04

whilo mentioned this pull request May 8, 2026

Filters-on-scan + linear-agg rewrite + filter-aware stats-only #20

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planner pushdowns: LHead, HAVING→WHERE, multi-key Top-N, row-group ordering#19

Planner pushdowns: LHead, HAVING→WHERE, multi-key Top-N, row-group ordering#19
whilo merged 2 commits into
mainfrom
feature/limit-pushdown

whilo commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whilo commented May 8, 2026

Summary

Verified

Test plan

Architecture notes

Follow-ups (separate PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant