Planner completeness: flip *use-planner* default-on#18
Merged
Conversation
Phase A of the planner-completeness work. The IR planner was missing
an entire frontend pre-processing layer the legacy `q` body has
always run inline. Without it, raw expression vectors and unresolved
string predicates reached the executor and exploded at
`eval-expr-polymorphic` ('Unsupported vectorized expr') or in
`prepare-aggregation` ('Cannot load from long array because
parameter1 is null') for queries that are routine on the legacy path.
Three changes:
1. **`stratum.query.prepare/prepare-query` shared helper.** Lifts the
legacy lowering passes into a single module both `q`-legacy and
`executor/run-query` can call:
a. normalize predicates and aggregates
b. pre-materialize string-producing predicate expressions
(`LOWER(name) = 'bob'` → `__pred_str_N`)
c. pre-materialize numeric predicate expressions
(`x + y > 10` → `__expr_N`)
d. materialize string predicates (LIKE / CONTAINS) into mask
columns
e. resolve dict-encoded equality predicates by mapping the
right-hand string/keyword to its dict-id
f. compile non-SIMD predicates (OR, IN, NOT-IN, :fn) into a
single mask column referenced as `[:__mask :eq 1]`
g. pre-materialize string-producing exprs in GROUP BY / aggs /
SELECT into dict-encoded temp columns
Returns `{:preds :aggs :group :select :columns :columns-meta}`.
`executor/run-query` now binds `expr/*columns-meta*` from the
returned `:columns-meta` so downstream expression eval (windows,
`eval-expr-polymorphic`, etc.) sees temp dict-encoded columns.
2. **`build-logical-plan` honors a `::pre-normalized?` flag.** The
legacy normalize-{pred,agg,select-item} fns aren't idempotent, so
the planner path tells `build-logical-plan` it's already
normalized. Plan-internal references to the (now redundant)
private `normalize-select-item` were retired in favor of
`stratum.query.execution/normalize-select-item`, which is more
complete (handles `:as`, literals, expressions, and keywords).
3. **`collect-all-refs` walks every column-bearing slot recursively.**
Previously project items only contributed `:ref` (so an item with
`:expr` was invisible to column-pruning), single-agg nodes only
contributed `:col` (so `:cols` for sum-product and `:expr`
for inline expressions disappeared), and group keys with
non-keyword shapes weren't recursed into. The new
`collect-expr-refs!` helper walks normalized expression maps
(`{:op ... :args ... :branches ...}`) and pred-style vectors
uniformly. Also `rewrite-expr-group-keys` now normalizes the
group-key expression before handing it to `PMaterializeExpr` so
`eval-expr-vectorized` sees the `{:op :date-trunc :args ...}`
form rather than the raw `[:date-trunc ...]` vector.
Effect with `*use-planner*` bound to `true` on the existing test
suite:
query-test + sql-test + parquet-test:
113 failures (start) → 41 (29 fail + 12 error)
The remaining failures cluster on shapes that need substantive new
work, not lowering: window functions (~10), top-N pushdown
correctness (~9, top-N is currently a legacy-only optimization),
anomaly-score / `ANOMALY_*` (~4), CAST edge cases (~6), SQL `COUNT`
shape divergence (~3). These are addressed in subsequent commits.
Legacy regression check (planner OFF, default): 424 tests / 1444
assertions all pass.
Two correctness gaps surfaced by running the test suite with
*use-planner* on:
1. **`:as` alias dropped on COUNT paths.** `PFusedSIMDCount`,
`PChunkedSIMDCount`, and `PBlockSkipCount` had no field for the
normalized agg, and their executors hard-coded
`{:op :count :as nil}` when calling `format-fused-result`. So
`SELECT COUNT(*) AS cnt FROM …` returned `:count` instead of
`:cnt`, and any test that read the result by alias got nil.
Each of the three IR records now carries the agg, and the
constructors in `select-global-agg-strategy` thread the
normalized first-agg through. The executors prefer the carried
agg, falling back to `{:op :count :as nil}` for older callers.
2. **`*columns-meta*` re-bound to `{}` inside `execute-physical`.**
`run-query` binds `expr/*columns-meta*` from `prepare-query`'s
output so downstream expression eval (e.g. `LENGTH` on a dict-
encoded column) sees the dict info. `execute-physical` then
shadowed it with `{}`, losing the binding and making string
functions return `0.0` instead of computed values. Removed the
redundant binding; the var's root value (`{}`) still applies if
`execute-physical` is called directly without a prior `binding`.
Effect on the test suite with planner ON:
41 failures (after Phase A) → 30 failures (21 fail + 9 error)
Fixed: length-function-test, e2e-{simple-count,in,between},
cast-string-to-{double,long}, cast-invalid-string,
sql window-having-pushdown alias path.
Legacy regression: 424 / 1444 still green.
window so partition keys survive Two structural fixes for SQL window-function queries through the planner: 1. **`execute-window` now handles a column context input.** Previously it only returned a result when the input was already a vector of row maps; for queries without group-by/aggregate (the common `SELECT col, ROW_NUMBER() OVER (...) FROM t` shape) it fell through and returned the input ctx unchanged, so window functions never executed. The new path materializes columns, calls `win/execute-window-functions`, and threads the augmented column map back into the ctx for downstream PProject / PHaving / PSort to consume. 2. **`build-logical-plan` defers `LProject` past `LWindow`** when a window is present. SQL evaluation order is FROM → WHERE → GROUP BY → HAVING → window → SELECT — projecting before window strips `:partition-by` / `:order-by` columns that the window needs. The new ordering applies LWindow first, then LProject; window-output columns (`:as` of each spec) are auto-appended to the select list if the user didn't list them explicitly, and a synthetic select is built when the user wrote no SELECT at all. Mirrors the legacy `q.clj:807-826` injection. Effect on the test suite with planner ON: 30 failures (after Phase B count fixes) → 20 failures (11 fail + 9 error). Window-function-execution-test, window-frame-test, and ntile-percent-rank-cume-dist-test all pass. Remaining: top-N pushdown (12, port to IR) and anomaly model (6, hardcoded to legacy). Legacy regression: 424 / 1444 still green.
in apply-distinct
Two small but blocking fixes:
1. **`prepare-query` defers predicate lowering when `:join` is set.**
The legacy `q` runs predicate lowering AFTER the join has merged
columns, so the WHERE predicate references both sides. The
planner ran prepare-query upfront against the left-side `:from`
columns only — for `WHERE right.cat = 1` that meant
`pred/compile-pred-mask` couldn't resolve `:cat` and emitted code
with an `aget` call that wouldn't compile ('More than one
matching method found'). prepare-query now skips numeric / string
/ dict / non-SIMD-mask passes when joins are present; the
executor's per-filter `prepare-preds` (executor.clj:56-101)
handles the lowering at LFilter execution time, when joined
columns are in scope. Predicates are still normalized so
`build-logical-plan` sees a consistent shape.
2. **`apply-distinct` canonicalizes -0.0 → +0.0** before hashing.
Java's `HashSet<Double>` uses bit-pattern equality, so SQL's
`SELECT DISTINCT v` returned `-0.0` and `+0.0` as separate rows
on the planner path (legacy path went through a streaming
primitive that already canonicalized). The planner now matches.
Effect on the test suite with planner ON:
20 failures (after window) → 19 failures (11 fail + 8 error).
Fixes: join-with-filter-test, distinct-double-zero-canonicalization-test.
Remaining: top-N pushdown (12, port to IR) and anomaly model (6).
Legacy regression: 424 / 1444 still green.
Top-N (`ORDER BY col [DESC] LIMIT N`) was a legacy-only fast path:
the planner fell through to materialize-and-sort, regressing
performance the bugfix branch added. Port the optimization to the
IR so the planner matches the legacy on these shapes.
Wiring:
- New `ir.LTopN` node carrying `[order-spec limit select input]`.
No separate physical record — the executor recognizes LTopN
directly and delegates to the existing
`stratum.query.top-n/execute-top-n` primitive (heap of size N
+ per-row column fetch from surviving chunks).
- New `plan.top-n-rewrite` optimization pass detects
`LLimit { input: LSort [single-spec] (LScan or LProject(LScan)) }`
with N ≤ `*top-n-limit*` (default 1024), no offset, numeric
non-string-dict order column. Runs BEFORE strategy-selection so
the LLimit/LSort haven't been converted to PLimit/PSort yet, and
BEFORE column-pruning so the LScan keeps every column the
surviving rows might project. When LSort sits over an LProject,
the project items are absorbed into LTopN's `:select` field
(top-N's executor handles row-level projection itself).
- `collect-all-refs` walks LTopN: order column + project items, or
every scan column for SELECT *. Without this, column-pruning
would drop everything except the order key.
- `executor/execute-top-n-node` translates the LTopN's normalized
shape back into the synthetic query map `top-n/execute-top-n`
expects.
Effect with `*use-planner*` ON:
19 failures (after distinct fix) → 6 failures
(1 fail + 5 error). All 12 top-n-{pushdown-correctness,
split-chunk-id} tests pass. Remaining: anomaly model (6 tests,
Phase C2).
Legacy regression: 424 / 1444 still green.
anomaly+join `[:anomaly-score "model" …]` and friends aren't recognized by `normalize-expr`, so they have to be resolved into synthetic column references *before* any other lowering. The legacy `q` runs `resolve-anomaly-columns` inline; the planner needs the same behaviour. Wiring: - Move `resolve-anomaly-columns` and helpers (`anomaly-ops`, `collect-anomaly-exprs`, `rewrite-anomaly-exprs`, `select-alias-map`) from `stratum.query` into the shared `stratum.query.prepare` ns. They use only `expr` / `norm` / `x` / `iforest`, so the relocation is mechanical. (`stratum.query` could call them via require but already requires `stratum.query.executor`, creating a cycle if executor required query — moving to prepare avoids it.) - `executor/prepare-and-build` runs `resolve-anomaly-columns` before `prepare-query` when the query map carries `:_anomaly-models`. Mirrors the legacy `q` body. Limitation: anomaly + join cannot be resolved before plan time because the iforest features may live on the join's right side. The planner's pre-plan resolution would see only the left-side columns and throw `Column :offset not found in data`. For this shape the `q` dispatch falls back to the legacy path, which resolves anomaly post-join. Documented as a follow-up. Effect with `*use-planner*` ON: 6 failures (after top-N port) → 0 failures (424 / 1444 all pass). Legacy regression: 424 / 1444 still green.
…explain shape Final batch of correctness fixes to take the IR planner the rest of the way to test parity with the legacy `q` body, then flip `*use-planner*` to `true`. - predicate-pushdown: respect outer-join semantics. LEFT preserves left rows on right miss → can't push right-side preds (and symmetric for RIGHT/FULL). Anything we can't push stays above the join. Fixes LEFT JOIN + WHERE-on-right tests that were silently dropping rows. - bitmap-semi-join eligibility: any reference to a build-side column from the parent disqualifies the rewrite (including the join key), since the rewrite discards the right side after building the presence bitmap. Mirrors the `(not has-select?)` clause in the legacy `query.join` gate. - build-join-tree: strip table-qualified namespaces off `:on` pairs (`:t1/a` → `:a`) so they line up with the unqualified column-map keys. Self-join no longer trips on missing columns. - estimate/sample-estimate: skip when args are non-numeric or the column is dict-string. The double-coercing path was throwing on string equality predicates we used to short-circuit. - executor/run-query: thread `::plan/order-only-keys` and `::plan/having-only-keys` through `optimize` (preserving the top-level metadata) and dissoc them from result rows. Matches the legacy `(if (seq _order-only-keys) (mapv #(apply dissoc % …)))`. - executor/explain-query: include `:n-rows` and `:columns` so callers that probe the legacy explain shape keep working. - query/*use-planner*: default flips to `true`. The full 868-test suite (including sqllogictest) is green; A/B vs the legacy path is within noise on TPC-H Q1 (B2) and Q6 (B1) at 6M rows.
H2O-Q8 (Top-N per partition via ROW_NUMBER + HAVING) was 2.7× slower under the planner because `LHaving (LProject (LWindow ...))` had `PProject` materialize all 6M post-window rows before `PHaving` could filter them down to the surviving 120K. Mirror the legacy `q.clj:758-815` window-having pushdown: - new `window-having-pushdown` pass rewrites `LHaving preds (LProject items (LWindow specs in))` → `LProject items (LHaving preds (LWindow specs in))` when the project items are bare column refs and the having predicates only reference columns visible after `LWindow` (window outputs + scan inputs). Predicates are normalized in-place since `LHaving` keeps the user's raw form. - `execute-having` gets a column-context fast path: filter on raw arrays, gather only surviving indices, return a column ctx for the parent `PProject` to finish materializing. H2O-Q8 NT 6M rows: 1908ms → 395ms (4.8× speedup, 1.7× faster than legacy's 683ms). Plan after rewrite: PProject -> PHaving -> PWindow -> PSIMDFilter -> PScan. 868 tests still green.
`prepare.clj` passes 5a/5b materialize string-producing expressions in GROUP BY / aggregates / SELECT into temp columns sized to the PRE-join row count. When a query has both a `:join` and such an expression, the temp columns end up the wrong length once the join runs and the executor reads past the array end (or sees stale data). The planner doesn't have a post-join materialization pass yet, so fall back to the legacy `q` body for this combination, which materializes string exprs at the right point. Symmetric with the existing `:join + :_anomaly-models` fallback. Tracked as a follow-up to lift these passes into a post-join planner stage. 868 tests still green.
Implements the eight follow-ups surfaced by the post-flip planner
audit. Every change is paired with the legacy reference it mirrors
or the gap it closes; 868 tests / 4119 assertions remain green.
F1 — LSetOp executor dispatch
Add `LSetOp` case to `execute-node` (uses
`requiring-resolve 'stratum.query/q` to avoid the require cycle)
so `compile-physical` / `explain-query` callers don't crash on
UNION/INTERSECT/EXCEPT queries. The runtime path in `q` already
short-circuits set ops; this lets debug entry points share that
semantics.
F2 — Window-having pushdown brittleness
Make `normalize-pred` idempotent: detect already-normalized
`[col op & args]` form via `normalized-pred?` and return early.
Then move HAVING normalization into `build-logical-plan` (always
— `prepare-query`'s `pre-normalized?` flag only covers `:where`
/ `:agg`) and drop the redundant re-normalization in both the
`window-having-pushdown` rewrite and `having-fast-path-on-ctx`
in the executor. Verified the H2O-Q8 pushdown still fires after
the change (PProject → PHaving → PWindow plan shape preserved).
F3 — Float `:gte` / `:lte` selectivity overshoot
`estimate.clj` was computing `:gte t` selectivity as
`:gt (t-1)`, which is correct for ints but overshoots on
doubles (e.g. `mn = 4.5`, `t = 5.0` is mistakenly chunk-fully-
passing). Add direct `zone-map-estimate-gte` and
`-lte` with inclusive boundary tests and route the dispatcher
through them.
F4 — `PMaterializeExpr` long-target detection
`rewrite-expr-group-keys` now emits `:int64` target (group keys
are discrete by definition); `execute-materialize-expr` honors
it by calling `eval-expr-to-long`, which returns long[] direct
for date-trunc / date-add / extract ops. Skips the long → double
→ long round-trip the planner inherited and unblocks the dense
group-by all-long fast path. CB-Q43 (date-trunc minute → group
by) lands at 198ms — within noise of legacy 186ms.
F5 — `PFusedExtractCount` emission + executor
Port the legacy `q.clj:680-908` fused EXTRACT(unit, col) +
COUNT fast path. New `try-fused-extract-count` in
`strategy-selection` recognises the post-`expr-materialization`
shape (LGroupBy over PMaterializeExpr {:op #{:minute :hour
:second :day-of-week}}) and emits `PFusedExtractCount`,
bypassing the materialization. Executor case dispatches to
`ColumnOpsExt/fusedExtractCountDenseParallel` and decodes per
the legacy block. Closes the CB-Q19 10× gap.
F6 — String predicate sampling
Replace static heuristics (0.05 / 0.10) for `:like`, `:ilike`,
`:contains`, `:starts-with`, `:ends-with` and their negations
with 256-entry dict (or raw `String[]`) sampling. New
`like->regex` compiles SQL LIKE to a `Pattern` (only `%` / `_`
are wildcards; everything else is quoted). Wired into
`estimate-selectivity` between numeric sampling and the
heuristic fallback.
F7 — NDV from chunk stats
Add `estimate-ndv` to `estimate.clj`. Dict-encoded string →
`dict.length`. Indexed int64 → `min(length, max-min+1)` from
chunk stats. Otherwise the legacy `length/10` heuristic.
Provides a callable distinct-count primitive the rest of the
planner can lean on.
F8 — NDV-based join cardinality
`propagate-est-rows` for `PHashJoin` switches from the
degenerate `min(L,R) × selectivity` heuristic to the textbook
formula:
output = probe_rows × build_rows / max(probe_ndv, build_ndv)
on the first join key, falling back to the prior heuristic when
the join key column / scan isn't reachable (multi-key chains,
wrapped sides). Tightens DP join ordering and dense-vs-hash
group-by routing.
Bench (T1+T2 olap, 6M rows, planner-on, all PASS):
H2O-Q8 NT (window TOP-N) 479ms (was 1908ms before F2)
CB-Q19 (extract+count) 4.4ms (planner ≈ legacy)
B1/B3/B5/B6 / H2O-Q1..Q10 within ±10% of legacy
Bitmap semi-join (Q1/Q3) 3-12× faster than legacy
Closes the last two divergence points between the planner and the
legacy `q` body — both queries with `:join` + `:_anomaly-models`
and queries with `:join` + string-producing exprs in `:group` /
`:select` now run end-to-end through the IR planner. The legacy
fallback in `query.clj` is gone; `*use-planner* true` is the only
runtime path.
F9 — Anomaly + join
- prepare.clj split: `anomaly-spec` does pure rewriting (collect
every `[:anomaly-* …]` expression, assign synthetic
`__<op>_<model>` columns, rewrite `:select`/`:where`/`:having`/
`:order`); `materialize-anomaly` runs the iforest scoring
against a column ctx. `resolve-anomaly-columns` is now a
one-liner that calls both for the no-join path.
- New `LAnomaly` IR record, placed after `LJoin` by
`build-logical-plan` when the frontend supplied a spec.
`executor.clj` adds an LAnomaly case that calls
`materialize-anomaly` against the post-join column ctx and
returns a column ctx with the synthetic columns added.
- `column-pruning`'s `collect-all-refs` walks each anomaly
expression's argument list (long form) or the model's
`:feature-names` (short form) so the pre-join scans keep the
columns the iforest needs.
F10 — String-producing exprs + join
- prepare.clj passes 5a / 5b grew deferred siblings:
`string-expr-spec-group-agg` and `string-expr-spec-select`
rewrite the slots and emit `[{:col-name :__str_expr_N
:expr <normalized>}]` items without calling
`eval-string-expr`. `prepare-query` returns the items as
`:string-items`; the no-join path keeps the legacy eager
materialization untouched.
- New `LStringMaterialize` IR record. Placed after `LAnomaly`
(and `LJoin`) by `build-logical-plan` when items are present.
`executor.clj` runs `expr/eval-string-expr` per item against
the post-join column ctx.
- `column-pruning`'s `collect-all-refs` walks each item's
expression so referenced columns survive on the scans.
The brittle `try/catch normalize-expr` gate I added earlier in
`query.clj` is gone too — the planner handles every shape now,
so there's no fallback to choose between.
Verified end-to-end:
- sql-anomaly-join-scoring-test: passes (was the canonical
join + ANOMALY_SCORE test; legacy result matches)
- GROUP BY UPPER(cat) over a join: planner returns the same
rows as legacy (synthetic `__str_expr_1` column key)
- 868 tests / 4119 assertions green
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the IR query planner (
stratum.query.executor/plan/prepare) to test parity with the legacyconddispatcher instratum.query/q, flips*use-planner*totrueby default, and lands the ten audit follow-ups (F1–F10) — including post-join handling of anomaly scoring and string-producing expressions, the last two paths that used to fall back to the legacy body. With this PR*use-planner* trueis the only runtime path; there are no escape hatches left inq. 868 tests / 4119 assertions green.Commits
prepare.cljshared frontend, planner rewrites (predicate pushdown, top-N, window-having pushdown, etc.), executor wiring, default flip.:gte/:lteselectivity,:int64materialize-expr target,PFusedExtractCountemission, string predicate sampling, NDV from chunk stats, NDV-based join cardinality.LAnomaly+LStringMaterializeIR records — anomaly scoring and string-producing expressions in:group/:select/etc. now run end-to-end in the planner over a join. Drops every legacy-fallback condition inquery.clj.Audit follow-ups (resolved in this PR)
LSetOpexecute-node dispatchrequiring-resolveto re-enterqfor sub-queries; debug entry points (compile-physical,explain-query) no longer crash on UNION/INTERSECT/EXCEPT.window-having-pushdownbrittlenessnormalize-predis now idempotent (normalized-pred?short-circuits already-normalized form); HAVING is normalized atbuild-logical-plantime (always, sinceprepare-query'spre-normalized?flag only covers:where/:agg); redundant re-normalizations dropped.:gte/:lteselectivity overshootzone-map-estimate-gte/-ltewith inclusive-bound tests; the dispatcher uses these instead of the int-style(dec t)/(inc t)reduction.PMaterializeExprlong-target:int64target; executor callseval-expr-to-long, skipping the long → double → long round-trip.PFusedExtractCountemissiontry-fused-extract-countrecognisesEXTRACT(unit, col) GROUP BY → COUNTand emitsPFusedExtractCount(executor dispatches toColumnOpsExt/fusedExtractCountDenseParallel); closes the CB-Q19 10× gap.String[]sampling for:like,:ilike,:contains,:starts-with,:ends-with(and negations); SQL LIKE → regex vialike->regex.estimate-ndv: dict-encoded →dict.length; indexed int64 →min(length, max-min+1); fallbacklength/10.propagate-est-rowsforPHashJoinuses the textbookprobe_rows × build_rows / max(probe_ndv, build_ndv)on the first join key; falls back to the legacy heuristic for multi-key / wrapped sides.resolve-anomaly-columnsintoanomaly-spec(pure rewrite) +materialize-anomaly(runtime scoring). NewLAnomalyIR node placed afterLJoin; executor case scores against the post-join column ctx.column-pruningwalks anomaly expression args + model:feature-namesso the pre-join scans keep the right columns.string-expr-spec-group-agg/-selectrewrite group/aggs/select to__str_expr_Nsynthetic refs without callingeval-string-expr. NewLStringMaterializeIR node placed afterLAnomaly; executor runseval-string-exprper item against the post-join column ctx.column-pruningwalks the items' expressions. The brittletry/catch normalize-exprgate inquery.cljis gone — the planner handles every shape now.Performance
T1+T2 olap bench, 6M rows, 8-core Lunar Lake (planner-on, all 26 PASS / 0 FAIL):
vs. the legacy path (planner-off, same JVM, same data): all queries within ±10% on standard shapes; planner faster on bitmap semi-joins (3-13×), window TOP-N (1.35×), and now also on the EXTRACT-COUNT and the post-join anomaly + string-expr paths.
Coverage
The planner now handles every shape
qexposes:row_number,rank,lag,lead,running-sum, etc.) with HAVING pushdownLIMIT ≤ 1024overORDER BY+ scan/project)ANOMALY_SCORE,ANOMALY_PREDICT,ANOMALY_PROBA,ANOMALY_CONFIDENCE) — with or without joinUPPER,LOWER,CONCAT,TRIM,SUBSTR,REPLACE) in GROUP BY / SELECT / aggs — with or without joinThere are no remaining legacy fallbacks.
qalways routes throughexecutor/run-query.Test plan
clojure -M:test— 868 tests, 4119 assertions, 0 failuresPFusedExtractCountproduces 60 rows)sql-anomaly-join-scoring-test(canonical join + ANOMALY_SCORE)clojure -M:ffix(cljfmt)