Skip to content

Planner pushdowns: LHead, HAVING→WHERE, multi-key Top-N, row-group ordering#19

Merged
whilo merged 2 commits into
mainfrom
feature/limit-pushdown
May 8, 2026
Merged

Planner pushdowns: LHead, HAVING→WHERE, multi-key Top-N, row-group ordering#19
whilo merged 2 commits into
mainfrom
feature/limit-pushdown

Conversation

@whilo
Copy link
Copy Markdown
Member

@whilo whilo commented May 8, 2026

Summary

Four local pushdown rewrites, each mirroring a corresponding DuckDB optimizer pass:

# Pass DuckDB analog
F11 head-rewrite: bare LIMIT NLHead (first-N rows from scan, no full materialize) PhysicalStreamingLimit + LimitPushdown
F12 having-to-where-pushdown: HAVING g > x (group col) → WHERE g > x FilterPushdown::PushdownAggregate
F13 Multi-key Top-N (was single-column only) PhysicalTopN with vector<BoundOrderByNode>
F14 Row-group ordering: visit chunks in primary-key min/max order, short-circuit when next chunk can't dethrone heap RowGroupPruner::set_scan_order

All four close concrete user-visible gaps. F11 fixes Ryan's SELECT * FROM big_table LIMIT 3 Parquet OOM (reported in #stratum-clojure). F12 lets composite HAVING filter rows pre-aggregate. F13 unblocks the most common ORDER BY a, b LIMIT N shape. F14 turns sorted timestamps / append-only logs into a single-chunk hit.

Verified

  • F11: 5M × 8-column index, SELECT * LIMIT 3 warm = 6.7 ms (was OOM at 4 GB heap, 30+ s stall at 16 GB).
  • F12: pushed plan PDenseGroupBy → PScan (no PHaving wrapper); pushed result = legacy result.
  • F13: 6-row mixed ASC/DESC produces {1,50},{1,20},{2,60},{2,30},{3,40},{3,10} (expected interleaved order); 50K-row index multi-key matches naive sort+limit.
  • F14: 5M sorted ASC LIMIT 10 = 1.04 ms vs 4.18 ms on random — 4× speedup from chunk pruning.

Test plan

  • 935 tests / 4292 assertions green
  • 14 new unit tests across test/stratum/limit_pushdown_test.clj
  • REPL end-to-end on 5M-row datasets (sorted, random, multi-key, mixed-direction)
  • clojure -M:ffix

Architecture notes

  • LHead and LTopN are deliberately parallel: same [limit select input] shape (LTopN adds order-specs); both peel an LProject through peel-project; both fetch only surviving rows; both run before column-pruning.
  • Multi-key sort-key uses a ^doubles array (1 element per ORDER BY column). Cast int64→double loses precision only for values > 2^53 (typical timestamps in millis are far below this). DuckDB packs into string_t (radix-encoded blob) for proper byte-comparison; if precision becomes an issue, stratum can move to the same encoding without further IR changes.
  • F14's chunk pruning is consistent regardless of input: sorted data terminates after the first chunk; random data pays only O(K log K) chunk-stats sort overhead, no per-row overhead.

Follow-ups (separate PR)

The architectural pushdowns from the same audit — generic pushdown-target walker (DuckDB's GetPushdownFilterTargets), filter+limit early-exit (push-pipeline + FINISHED propagation), late materialization for SELECT * ORDER BY x LIMIT N over wide tables, dynamic filter from join build-side to probe scan — share infrastructure (per-IR-node order-preservation metadata) and warrant a single follow-up PR. Tracked locally; will open as a sibling PR.

whilo added 2 commits May 8, 2026 09:22
`SELECT * FROM big_table LIMIT 3` was decoding every row group of
every column before applying LIMIT. The materialization budget
guard (~70% of heap) blocked the query at small heaps with
"Query would materialize ~8 GB"; at 16 GB it ran for tens of
seconds spinning on full-column decode. Reported on
#stratum-clojure with a Parquet-backed table.

Root cause: the planner had no rewrite for `LLimit (LScan)` or
`LLimit (LProject ref-only LScan)`. `top-n-rewrite` only fires on
LSort+LLimit; bare LIMIT fell through to PProject which calls
`materialize-columns` over the full input.

Mirroring `LTopN`:

  - New `LHead [limit select input]` IR record.
  - New `head-rewrite` plan pass — recognizes `LLimit (LScan)` and
    `LLimit (LProject ref-only LScan)`, gates on no-WHERE / no-
    aggregate / no-window / no-OFFSET / `LIMIT <= *head-limit*`
    (default 100K). Runs before `strategy-selection` and
    `column-pruning`.
  - New `execute-head-node` in the executor — reads the first N
    rows of each referenced column via `cols/take-prefix-column`,
    decodes dict-encoded strings inline, returns row maps. Mirrors
    `execute-top-n-node` minus the heap.
  - `cols/take-prefix-column` materializes only the requested
    prefix: array → `Arrays/copyOfRange`, index →
    `index/idx-materialize-to-array-prefix` (new — walks chunks in
    order and stops once N elements are copied; a single chunk
    suffices for typical small limits).
  - `collect-all-refs` walks LHead like LTopN — explicit
    `:select` items count as live, SELECT * (`:select nil`) keeps
    every input scan column.

Verified end-to-end on a 5M × 8 column index dataset:
  - `SELECT * FROM t LIMIT 3` warm: 6.7 ms (was OOM at 4 GB heap,
    stalled for 30+ s at 16 GB heap).
  - Plan shape: `LHead → PScan` (no PProject, no full materialize).
  - Eligibility correctly skips +WHERE, +ORDER BY (routes to
    LTopN), +OFFSET, +LIMIT > threshold.
  - 924 tests / 4279 assertions green (10 new unit tests for the
    rewrite + execution).
Three local pushdown rewrites that close concrete ORDER BY / HAVING
gaps the audit surfaced. Each mirrors a DuckDB optimizer pass to
keep the architecture consistent.

F12 — HAVING → WHERE pushdown (filter-through-aggregate)

  When a HAVING predicate references only group columns (not
  aggregate aliases), push it below LGroupBy as a pre-aggregate
  WHERE filter — same pattern as DuckDB's
  `FilterPushdown::PushdownAggregate`. Eliminates rows before they
  hit the group-by hash table; matches the constraint-as-early-as-
  possible principle of the existing predicate-pushdown-through-
  join pass.

  Implementation: new `having-to-where-pushdown` pass right after
  `predicate-pushdown` in `optimize`. Walks the plan, classifies
  each pred via `pred-columns` (set membership in the GroupBy's
  `:group-keys`), pushes into a new (or merged) `LFilter` below
  LGroupBy, keeps the rest on LHaving. Skips LGlobalAgg (no group
  cols) and LWindow-wrapped LHaving (already covered by
  `window-having-pushdown`).

F13 — Multi-key Top-N

  `LTopN` previously gated on single-column ORDER BY; multi-key
  (`ORDER BY a ASC, b DESC LIMIT N`) fell through to materialize +
  full sort, unnecessarily slow for the common composite-ordering
  shape. Loosen the gate and rewrite the heap to handle any arity.

  - `TopNEntry` field `^double key` → `^doubles keys` (one entry
    per ORDER BY column, populated row-by-row from a reused
    scratch array; we allocate the permanent in-heap copy only on
    insert/eviction).
  - `entry-cmp` takes an `^ints dirs` (`+1` for ASC, `-1` for
    DESC) and walks keys in declared order, returning the first
    non-zero per-key result. Mixed asc/desc supported.
  - `find-top-n-on-array` / `find-top-n-on-index` consolidated
    into multi-column `find-top-n-on-arrays` /
    `find-top-n-on-indices`. Single-key callers pass a 1-element
    vec — same code path, no perf regression.
  - `top-n-eligible?` and the planner's `top-n-order-eligible?`
    now accept N order specs (still numeric, still LIMIT bound).
  - `LTopN`'s field renamed `order-spec` → `order-specs` (vec of
    `[col dir]` pairs, mirroring DuckDB's
    `vector<BoundOrderByNode>`).

F14 — Row-group ordering for ORDER BY+LIMIT on monotonic columns

  When the primary ORDER BY column has chunk-level min/max stats
  (always true for index-backed columns), iterate chunks in
  primary-order — ASC: ascending min, DESC: descending max — and
  short-circuit the loop the moment a chunk's primary bound can't
  beat the heap's worst kept primary key. Mirrors DuckDB's
  `RowGroupPruner` (`src/optimizer/row_group_pruner.cpp`,
  `set_scan_order`) but applied to streaming top-N instead of a
  separate scan-reorder pass.

  Sorted timestamps / append-only logs hit the early-termination
  the moment the heap fills (typically after one chunk). Random
  inputs see no overhead beyond an O(K log K) chunk-stats sort
  (K = number of chunks, typically ≤ 1000).

  - `chunk-iteration-order`: permutation of `[0..n-chunks)` sorted
    by primary stats.
  - `can-prune-rest?`: after the heap is full, `chunk-min >=
    heap-worst` (ASC) or `chunk-max <= heap-worst` (DESC) means
    every remaining chunk is worse — stop.

Verified end-to-end:
  - 5M-row sorted column ORDER BY ASC LIMIT 10: 1.04 ms (vs 4.18 ms
    on random input — 4× speedup from chunk pruning).
  - Multi-key ORDER BY produces identical ordering to a naive
    sort+limit.
  - Mixed ASC/DESC ordering produces the expected interleaved
    output.
  - HAVING(group-col) + HAVING(agg) splits correctly: pushed pred
    matches WHERE(group-col) + HAVING(agg).
  - 935 tests / 4292 assertions green (11 new tests across F12-F14).
@whilo whilo merged commit bc97f36 into main May 8, 2026
5 of 6 checks passed
@whilo whilo deleted the feature/limit-pushdown branch May 8, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant