Planner pushdowns: LHead, HAVING→WHERE, multi-key Top-N, row-group ordering#19
Merged
Conversation
`SELECT * FROM big_table LIMIT 3` was decoding every row group of
every column before applying LIMIT. The materialization budget
guard (~70% of heap) blocked the query at small heaps with
"Query would materialize ~8 GB"; at 16 GB it ran for tens of
seconds spinning on full-column decode. Reported on
#stratum-clojure with a Parquet-backed table.
Root cause: the planner had no rewrite for `LLimit (LScan)` or
`LLimit (LProject ref-only LScan)`. `top-n-rewrite` only fires on
LSort+LLimit; bare LIMIT fell through to PProject which calls
`materialize-columns` over the full input.
Mirroring `LTopN`:
- New `LHead [limit select input]` IR record.
- New `head-rewrite` plan pass — recognizes `LLimit (LScan)` and
`LLimit (LProject ref-only LScan)`, gates on no-WHERE / no-
aggregate / no-window / no-OFFSET / `LIMIT <= *head-limit*`
(default 100K). Runs before `strategy-selection` and
`column-pruning`.
- New `execute-head-node` in the executor — reads the first N
rows of each referenced column via `cols/take-prefix-column`,
decodes dict-encoded strings inline, returns row maps. Mirrors
`execute-top-n-node` minus the heap.
- `cols/take-prefix-column` materializes only the requested
prefix: array → `Arrays/copyOfRange`, index →
`index/idx-materialize-to-array-prefix` (new — walks chunks in
order and stops once N elements are copied; a single chunk
suffices for typical small limits).
- `collect-all-refs` walks LHead like LTopN — explicit
`:select` items count as live, SELECT * (`:select nil`) keeps
every input scan column.
Verified end-to-end on a 5M × 8 column index dataset:
- `SELECT * FROM t LIMIT 3` warm: 6.7 ms (was OOM at 4 GB heap,
stalled for 30+ s at 16 GB heap).
- Plan shape: `LHead → PScan` (no PProject, no full materialize).
- Eligibility correctly skips +WHERE, +ORDER BY (routes to
LTopN), +OFFSET, +LIMIT > threshold.
- 924 tests / 4279 assertions green (10 new unit tests for the
rewrite + execution).
Three local pushdown rewrites that close concrete ORDER BY / HAVING
gaps the audit surfaced. Each mirrors a DuckDB optimizer pass to
keep the architecture consistent.
F12 — HAVING → WHERE pushdown (filter-through-aggregate)
When a HAVING predicate references only group columns (not
aggregate aliases), push it below LGroupBy as a pre-aggregate
WHERE filter — same pattern as DuckDB's
`FilterPushdown::PushdownAggregate`. Eliminates rows before they
hit the group-by hash table; matches the constraint-as-early-as-
possible principle of the existing predicate-pushdown-through-
join pass.
Implementation: new `having-to-where-pushdown` pass right after
`predicate-pushdown` in `optimize`. Walks the plan, classifies
each pred via `pred-columns` (set membership in the GroupBy's
`:group-keys`), pushes into a new (or merged) `LFilter` below
LGroupBy, keeps the rest on LHaving. Skips LGlobalAgg (no group
cols) and LWindow-wrapped LHaving (already covered by
`window-having-pushdown`).
F13 — Multi-key Top-N
`LTopN` previously gated on single-column ORDER BY; multi-key
(`ORDER BY a ASC, b DESC LIMIT N`) fell through to materialize +
full sort, unnecessarily slow for the common composite-ordering
shape. Loosen the gate and rewrite the heap to handle any arity.
- `TopNEntry` field `^double key` → `^doubles keys` (one entry
per ORDER BY column, populated row-by-row from a reused
scratch array; we allocate the permanent in-heap copy only on
insert/eviction).
- `entry-cmp` takes an `^ints dirs` (`+1` for ASC, `-1` for
DESC) and walks keys in declared order, returning the first
non-zero per-key result. Mixed asc/desc supported.
- `find-top-n-on-array` / `find-top-n-on-index` consolidated
into multi-column `find-top-n-on-arrays` /
`find-top-n-on-indices`. Single-key callers pass a 1-element
vec — same code path, no perf regression.
- `top-n-eligible?` and the planner's `top-n-order-eligible?`
now accept N order specs (still numeric, still LIMIT bound).
- `LTopN`'s field renamed `order-spec` → `order-specs` (vec of
`[col dir]` pairs, mirroring DuckDB's
`vector<BoundOrderByNode>`).
F14 — Row-group ordering for ORDER BY+LIMIT on monotonic columns
When the primary ORDER BY column has chunk-level min/max stats
(always true for index-backed columns), iterate chunks in
primary-order — ASC: ascending min, DESC: descending max — and
short-circuit the loop the moment a chunk's primary bound can't
beat the heap's worst kept primary key. Mirrors DuckDB's
`RowGroupPruner` (`src/optimizer/row_group_pruner.cpp`,
`set_scan_order`) but applied to streaming top-N instead of a
separate scan-reorder pass.
Sorted timestamps / append-only logs hit the early-termination
the moment the heap fills (typically after one chunk). Random
inputs see no overhead beyond an O(K log K) chunk-stats sort
(K = number of chunks, typically ≤ 1000).
- `chunk-iteration-order`: permutation of `[0..n-chunks)` sorted
by primary stats.
- `can-prune-rest?`: after the heap is full, `chunk-min >=
heap-worst` (ASC) or `chunk-max <= heap-worst` (DESC) means
every remaining chunk is worse — stop.
Verified end-to-end:
- 5M-row sorted column ORDER BY ASC LIMIT 10: 1.04 ms (vs 4.18 ms
on random input — 4× speedup from chunk pruning).
- Multi-key ORDER BY produces identical ordering to a naive
sort+limit.
- Mixed ASC/DESC ordering produces the expected interleaved
output.
- HAVING(group-col) + HAVING(agg) splits correctly: pushed pred
matches WHERE(group-col) + HAVING(agg).
- 935 tests / 4292 assertions green (11 new tests across F12-F14).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four local pushdown rewrites, each mirroring a corresponding DuckDB optimizer pass:
head-rewrite: bareLIMIT N→LHead(first-N rows from scan, no full materialize)PhysicalStreamingLimit+LimitPushdownhaving-to-where-pushdown:HAVING g > x(group col) →WHERE g > xFilterPushdown::PushdownAggregatePhysicalTopNwithvector<BoundOrderByNode>RowGroupPruner::set_scan_orderAll four close concrete user-visible gaps. F11 fixes Ryan's
SELECT * FROM big_table LIMIT 3Parquet OOM (reported in #stratum-clojure). F12 lets composite HAVING filter rows pre-aggregate. F13 unblocks the most commonORDER BY a, b LIMIT Nshape. F14 turns sorted timestamps / append-only logs into a single-chunk hit.Verified
SELECT * LIMIT 3warm = 6.7 ms (was OOM at 4 GB heap, 30+ s stall at 16 GB).PDenseGroupBy → PScan(no PHaving wrapper); pushed result=legacy result.{1,50},{1,20},{2,60},{2,30},{3,40},{3,10}(expected interleaved order); 50K-row index multi-key matches naive sort+limit.Test plan
test/stratum/limit_pushdown_test.cljclojure -M:ffixArchitecture notes
[limit select input]shape (LTopN addsorder-specs); both peel an LProject throughpeel-project; both fetch only surviving rows; both run beforecolumn-pruning.^doublesarray (1 element per ORDER BY column). Cast int64→double loses precision only for values > 2^53 (typical timestamps in millis are far below this). DuckDB packs intostring_t(radix-encoded blob) for proper byte-comparison; if precision becomes an issue, stratum can move to the same encoding without further IR changes.O(K log K)chunk-stats sort overhead, no per-row overhead.Follow-ups (separate PR)
The architectural pushdowns from the same audit — generic pushdown-target walker (DuckDB's
GetPushdownFilterTargets), filter+limit early-exit (push-pipeline +FINISHEDpropagation), late materialization forSELECT * ORDER BY x LIMIT Nover wide tables, dynamic filter from join build-side to probe scan — share infrastructure (per-IR-node order-preservation metadata) and warrant a single follow-up PR. Tracked locally; will open as a sibling PR.