Skip to content

Planner completeness: flip *use-planner* default-on#18

Merged
whilo merged 11 commits into
mainfrom
feature/planner-completeness
May 8, 2026
Merged

Planner completeness: flip *use-planner* default-on#18
whilo merged 11 commits into
mainfrom
feature/planner-completeness

Conversation

@whilo
Copy link
Copy Markdown
Member

@whilo whilo commented May 8, 2026

Summary

Brings the IR query planner (stratum.query.executor / plan / prepare) to test parity with the legacy cond dispatcher in stratum.query/q, flips *use-planner* to true by default, and lands the ten audit follow-ups (F1–F10) — including post-join handling of anomaly scoring and string-producing expressions, the last two paths that used to fall back to the legacy body. With this PR *use-planner* true is the only runtime path; there are no escape hatches left in q. 868 tests / 4119 assertions green.

Commits

  • Initial planner flip (8 commits): prepare.clj shared frontend, planner rewrites (predicate pushdown, top-N, window-having pushdown, etc.), executor wiring, default flip.
  • F1–F8 follow-ups (1 commit): LSetOp dispatch, idempotent normalize-pred, float :gte/:lte selectivity, :int64 materialize-expr target, PFusedExtractCount emission, string predicate sampling, NDV from chunk stats, NDV-based join cardinality.
  • F9–F10 follow-ups (1 commit): post-join LAnomaly + LStringMaterialize IR records — anomaly scoring and string-producing expressions in :group/:select/etc. now run end-to-end in the planner over a join. Drops every legacy-fallback condition in query.clj.

Audit follow-ups (resolved in this PR)

# Item Resolution
F1 LSetOp execute-node dispatch Added case using requiring-resolve to re-enter q for sub-queries; debug entry points (compile-physical, explain-query) no longer crash on UNION/INTERSECT/EXCEPT.
F2 window-having-pushdown brittleness normalize-pred is now idempotent (normalized-pred? short-circuits already-normalized form); HAVING is normalized at build-logical-plan time (always, since prepare-query's pre-normalized? flag only covers :where/:agg); redundant re-normalizations dropped.
F3 Float :gte/:lte selectivity overshoot New zone-map-estimate-gte / -lte with inclusive-bound tests; the dispatcher uses these instead of the int-style (dec t) / (inc t) reduction.
F4 PMaterializeExpr long-target Group-by exprs emit :int64 target; executor calls eval-expr-to-long, skipping the long → double → long round-trip.
F5 PFusedExtractCount emission New try-fused-extract-count recognises EXTRACT(unit, col) GROUP BY → COUNT and emits PFusedExtractCount (executor dispatches to ColumnOpsExt/fusedExtractCountDenseParallel); closes the CB-Q19 10× gap.
F6 String predicate sampling 256-entry dict / String[] sampling for :like, :ilike, :contains, :starts-with, :ends-with (and negations); SQL LIKE → regex via like->regex.
F7 NDV from chunk stats New estimate-ndv: dict-encoded → dict.length; indexed int64 → min(length, max-min+1); fallback length/10.
F8 NDV-based join cardinality propagate-est-rows for PHashJoin uses the textbook probe_rows × build_rows / max(probe_ndv, build_ndv) on the first join key; falls back to the legacy heuristic for multi-key / wrapped sides.
F9 Anomaly + join in planner Split resolve-anomaly-columns into anomaly-spec (pure rewrite) + materialize-anomaly (runtime scoring). New LAnomaly IR node placed after LJoin; executor case scores against the post-join column ctx. column-pruning walks anomaly expression args + model :feature-names so the pre-join scans keep the right columns.
F10 Post-join string-expr materialization New string-expr-spec-group-agg / -select rewrite group/aggs/select to __str_expr_N synthetic refs without calling eval-string-expr. New LStringMaterialize IR node placed after LAnomaly; executor runs eval-string-expr per item against the post-join column ctx. column-pruning walks the items' expressions. The brittle try/catch normalize-expr gate in query.clj is gone — the planner handles every shape now.

Performance

T1+T2 olap bench, 6M rows, 8-core Lunar Lake (planner-on, all 26 PASS / 0 FAIL):

Query 1T 8T DuckDB 8T Δ vs DuckDB 8T
B1 sum-product 17.2 11.0 5.4 0.49×
B2 TPC-H Q1 134.2 90.5 17.8 0.20×
B3 SSB Q1.1 17.0 8.5 5.5 0.65×
B5 filtered count 3.1 2.7 2.5 0.93×
B6 group-by-cnt 19.8 11.1 4.6 0.41×
H2O-Q6 STDDEV 39.2 29.6 26.1 0.88×
H2O-Q9 CORR 64.4 50.9 33.4 0.66×
H2O-Q8 (window TOP-N) 955 549 184 (5× planner-on→legacy)
H2O-Q10 (6 cols, 6M groups) 600 469 3939 8.4× faster
CB-Q19 (extract+count via PFusedExtractCount) 4.4 (parity vs legacy)
SEMI-Q1 18.6 6.5 86.8 13.3× faster
SEMI-Q3 29.4 10.3 72.5 7.0× faster
H2O-J1/J2/J3 (joins) 30/35/42 10/20/21 6/19/19 0.61/0.96/0.88×
H2O-Q3/Q7 (high-card group) 75/58 86/57 110/146 1.3/2.6×
H2O-Q4/Q5 (multi-AVG/SUM) 92/79 67/82 8/87 0.12/1.05×

vs. the legacy path (planner-off, same JVM, same data): all queries within ±10% on standard shapes; planner faster on bitmap semi-joins (3-13×), window TOP-N (1.35×), and now also on the EXTRACT-COUNT and the post-join anomaly + string-expr paths.

Coverage

The planner now handles every shape q exposes:

  • WHERE / aggregates / GROUP BY / HAVING / ORDER BY / LIMIT / OFFSET / DISTINCT / SELECT / SELECT *
  • INNER / LEFT / RIGHT / FULL joins with single- and multi-column keys
  • ASOF joins
  • Window functions (row_number, rank, lag, lead, running-sum, etc.) with HAVING pushdown
  • UNION / INTERSECT / EXCEPT
  • Top-N pushdown (LIMIT ≤ 1024 over ORDER BY + scan/project)
  • COUNT DISTINCT, percentile / median / approx-quantile, VARIANCE / STDDEV / CORR
  • Date/time arithmetic + extracts
  • LIKE / contains / starts-with / ends-with on dict and raw strings
  • Bitmap semi-join optimization
  • Anomaly scoring (ANOMALY_SCORE, ANOMALY_PREDICT, ANOMALY_PROBA, ANOMALY_CONFIDENCE) — with or without join
  • String-producing expressions (UPPER, LOWER, CONCAT, TRIM, SUBSTR, REPLACE) in GROUP BY / SELECT / aggs — with or without join

There are no remaining legacy fallbacks. q always routes through executor/run-query.

Test plan

  • clojure -M:test — 868 tests, 4119 assertions, 0 failures
  • T1+T2 olap bench at 6M rows (planner-on, 26 PASS / 0 FAIL)
  • T1+T2 olap bench at 6M rows (legacy, via local toggle)
  • H2O-Q8 result validation (Stratum 120000 rows == DuckDB 120000)
  • CB-Q19 result validation (PFusedExtractCount produces 60 rows)
  • CB-Q43 result validation (date-trunc minute group-by, 525590 buckets)
  • sql-anomaly-join-scoring-test (canonical join + ANOMALY_SCORE)
  • GROUP BY UPPER(cat) over a join (legacy / planner produce same rows)
  • clojure -M:ffix (cljfmt)

whilo added 11 commits May 7, 2026 14:34
Phase A of the planner-completeness work. The IR planner was missing
an entire frontend pre-processing layer the legacy `q` body has
always run inline. Without it, raw expression vectors and unresolved
string predicates reached the executor and exploded at
`eval-expr-polymorphic` ('Unsupported vectorized expr') or in
`prepare-aggregation` ('Cannot load from long array because
parameter1 is null') for queries that are routine on the legacy path.

Three changes:

1. **`stratum.query.prepare/prepare-query` shared helper.** Lifts the
   legacy lowering passes into a single module both `q`-legacy and
   `executor/run-query` can call:
     a. normalize predicates and aggregates
     b. pre-materialize string-producing predicate expressions
        (`LOWER(name) = 'bob'` → `__pred_str_N`)
     c. pre-materialize numeric predicate expressions
        (`x + y > 10` → `__expr_N`)
     d. materialize string predicates (LIKE / CONTAINS) into mask
        columns
     e. resolve dict-encoded equality predicates by mapping the
        right-hand string/keyword to its dict-id
     f. compile non-SIMD predicates (OR, IN, NOT-IN, :fn) into a
        single mask column referenced as `[:__mask :eq 1]`
     g. pre-materialize string-producing exprs in GROUP BY / aggs /
        SELECT into dict-encoded temp columns

   Returns `{:preds :aggs :group :select :columns :columns-meta}`.

   `executor/run-query` now binds `expr/*columns-meta*` from the
   returned `:columns-meta` so downstream expression eval (windows,
   `eval-expr-polymorphic`, etc.) sees temp dict-encoded columns.

2. **`build-logical-plan` honors a `::pre-normalized?` flag.** The
   legacy normalize-{pred,agg,select-item} fns aren't idempotent, so
   the planner path tells `build-logical-plan` it's already
   normalized. Plan-internal references to the (now redundant)
   private `normalize-select-item` were retired in favor of
   `stratum.query.execution/normalize-select-item`, which is more
   complete (handles `:as`, literals, expressions, and keywords).

3. **`collect-all-refs` walks every column-bearing slot recursively.**
   Previously project items only contributed `:ref` (so an item with
   `:expr` was invisible to column-pruning), single-agg nodes only
   contributed `:col` (so `:cols` for sum-product and `:expr`
   for inline expressions disappeared), and group keys with
   non-keyword shapes weren't recursed into. The new
   `collect-expr-refs!` helper walks normalized expression maps
   (`{:op ... :args ... :branches ...}`) and pred-style vectors
   uniformly. Also `rewrite-expr-group-keys` now normalizes the
   group-key expression before handing it to `PMaterializeExpr` so
   `eval-expr-vectorized` sees the `{:op :date-trunc :args ...}`
   form rather than the raw `[:date-trunc ...]` vector.

Effect with `*use-planner*` bound to `true` on the existing test
suite:

  query-test + sql-test + parquet-test:
    113 failures (start) → 41 (29 fail + 12 error)

The remaining failures cluster on shapes that need substantive new
work, not lowering: window functions (~10), top-N pushdown
correctness (~9, top-N is currently a legacy-only optimization),
anomaly-score / `ANOMALY_*` (~4), CAST edge cases (~6), SQL `COUNT`
shape divergence (~3). These are addressed in subsequent commits.

Legacy regression check (planner OFF, default): 424 tests / 1444
assertions all pass.
Two correctness gaps surfaced by running the test suite with
*use-planner* on:

1. **`:as` alias dropped on COUNT paths.** `PFusedSIMDCount`,
   `PChunkedSIMDCount`, and `PBlockSkipCount` had no field for the
   normalized agg, and their executors hard-coded
   `{:op :count :as nil}` when calling `format-fused-result`. So
   `SELECT COUNT(*) AS cnt FROM …` returned `:count` instead of
   `:cnt`, and any test that read the result by alias got nil.

   Each of the three IR records now carries the agg, and the
   constructors in `select-global-agg-strategy` thread the
   normalized first-agg through. The executors prefer the carried
   agg, falling back to `{:op :count :as nil}` for older callers.

2. **`*columns-meta*` re-bound to `{}` inside `execute-physical`.**
   `run-query` binds `expr/*columns-meta*` from `prepare-query`'s
   output so downstream expression eval (e.g. `LENGTH` on a dict-
   encoded column) sees the dict info. `execute-physical` then
   shadowed it with `{}`, losing the binding and making string
   functions return `0.0` instead of computed values. Removed the
   redundant binding; the var's root value (`{}`) still applies if
   `execute-physical` is called directly without a prior `binding`.

Effect on the test suite with planner ON:
  41 failures (after Phase A) → 30 failures (21 fail + 9 error)
  Fixed: length-function-test, e2e-{simple-count,in,between},
  cast-string-to-{double,long}, cast-invalid-string,
  sql window-having-pushdown alias path.

Legacy regression: 424 / 1444 still green.
window so partition keys survive

Two structural fixes for SQL window-function queries through the
planner:

1. **`execute-window` now handles a column context input.** Previously
   it only returned a result when the input was already a vector of
   row maps; for queries without group-by/aggregate (the common
   `SELECT col, ROW_NUMBER() OVER (...) FROM t` shape) it fell
   through and returned the input ctx unchanged, so window
   functions never executed. The new path materializes columns,
   calls `win/execute-window-functions`, and threads the augmented
   column map back into the ctx for downstream PProject / PHaving /
   PSort to consume.

2. **`build-logical-plan` defers `LProject` past `LWindow`** when a
   window is present. SQL evaluation order is FROM → WHERE → GROUP
   BY → HAVING → window → SELECT — projecting before window strips
   `:partition-by` / `:order-by` columns that the window needs. The
   new ordering applies LWindow first, then LProject; window-output
   columns (`:as` of each spec) are auto-appended to the select list
   if the user didn't list them explicitly, and a synthetic select
   is built when the user wrote no SELECT at all. Mirrors the legacy
   `q.clj:807-826` injection.

Effect on the test suite with planner ON:
  30 failures (after Phase B count fixes) → 20 failures (11 fail + 9
  error). Window-function-execution-test, window-frame-test, and
  ntile-percent-rank-cume-dist-test all pass. Remaining: top-N
  pushdown (12, port to IR) and anomaly model (6, hardcoded to
  legacy).

Legacy regression: 424 / 1444 still green.
in apply-distinct

Two small but blocking fixes:

1. **`prepare-query` defers predicate lowering when `:join` is set.**
   The legacy `q` runs predicate lowering AFTER the join has merged
   columns, so the WHERE predicate references both sides. The
   planner ran prepare-query upfront against the left-side `:from`
   columns only — for `WHERE right.cat = 1` that meant
   `pred/compile-pred-mask` couldn't resolve `:cat` and emitted code
   with an `aget` call that wouldn't compile ('More than one
   matching method found'). prepare-query now skips numeric / string
   / dict / non-SIMD-mask passes when joins are present; the
   executor's per-filter `prepare-preds` (executor.clj:56-101)
   handles the lowering at LFilter execution time, when joined
   columns are in scope. Predicates are still normalized so
   `build-logical-plan` sees a consistent shape.

2. **`apply-distinct` canonicalizes -0.0 → +0.0** before hashing.
   Java's `HashSet<Double>` uses bit-pattern equality, so SQL's
   `SELECT DISTINCT v` returned `-0.0` and `+0.0` as separate rows
   on the planner path (legacy path went through a streaming
   primitive that already canonicalized). The planner now matches.

Effect on the test suite with planner ON:
  20 failures (after window) → 19 failures (11 fail + 8 error).
  Fixes: join-with-filter-test, distinct-double-zero-canonicalization-test.
  Remaining: top-N pushdown (12, port to IR) and anomaly model (6).

Legacy regression: 424 / 1444 still green.
Top-N (`ORDER BY col [DESC] LIMIT N`) was a legacy-only fast path:
the planner fell through to materialize-and-sort, regressing
performance the bugfix branch added. Port the optimization to the
IR so the planner matches the legacy on these shapes.

Wiring:
- New `ir.LTopN` node carrying `[order-spec limit select input]`.
  No separate physical record — the executor recognizes LTopN
  directly and delegates to the existing
  `stratum.query.top-n/execute-top-n` primitive (heap of size N
  + per-row column fetch from surviving chunks).
- New `plan.top-n-rewrite` optimization pass detects
  `LLimit { input: LSort [single-spec] (LScan or LProject(LScan)) }`
  with N ≤ `*top-n-limit*` (default 1024), no offset, numeric
  non-string-dict order column. Runs BEFORE strategy-selection so
  the LLimit/LSort haven't been converted to PLimit/PSort yet, and
  BEFORE column-pruning so the LScan keeps every column the
  surviving rows might project. When LSort sits over an LProject,
  the project items are absorbed into LTopN's `:select` field
  (top-N's executor handles row-level projection itself).
- `collect-all-refs` walks LTopN: order column + project items, or
  every scan column for SELECT *. Without this, column-pruning
  would drop everything except the order key.
- `executor/execute-top-n-node` translates the LTopN's normalized
  shape back into the synthetic query map `top-n/execute-top-n`
  expects.

Effect with `*use-planner*` ON:
  19 failures (after distinct fix) → 6 failures
  (1 fail + 5 error). All 12 top-n-{pushdown-correctness,
  split-chunk-id} tests pass. Remaining: anomaly model (6 tests,
  Phase C2).

Legacy regression: 424 / 1444 still green.
anomaly+join

`[:anomaly-score "model" …]` and friends aren't recognized by
`normalize-expr`, so they have to be resolved into synthetic
column references *before* any other lowering. The legacy `q`
runs `resolve-anomaly-columns` inline; the planner needs the same
behaviour.

Wiring:
- Move `resolve-anomaly-columns` and helpers (`anomaly-ops`,
  `collect-anomaly-exprs`, `rewrite-anomaly-exprs`,
  `select-alias-map`) from `stratum.query` into the shared
  `stratum.query.prepare` ns. They use only `expr` / `norm` /
  `x` / `iforest`, so the relocation is mechanical. (`stratum.query`
  could call them via require but already requires
  `stratum.query.executor`, creating a cycle if executor required
  query — moving to prepare avoids it.)
- `executor/prepare-and-build` runs `resolve-anomaly-columns`
  before `prepare-query` when the query map carries
  `:_anomaly-models`. Mirrors the legacy `q` body.

Limitation: anomaly + join cannot be resolved before plan time
because the iforest features may live on the join's right side.
The planner's pre-plan resolution would see only the left-side
columns and throw `Column :offset not found in data`. For this
shape the `q` dispatch falls back to the legacy path, which
resolves anomaly post-join. Documented as a follow-up.

Effect with `*use-planner*` ON:
  6 failures (after top-N port) → 0 failures (424 / 1444 all pass).

Legacy regression: 424 / 1444 still green.
…explain shape

Final batch of correctness fixes to take the IR planner the rest of the
way to test parity with the legacy `q` body, then flip
`*use-planner*` to `true`.

- predicate-pushdown: respect outer-join semantics. LEFT preserves
  left rows on right miss → can't push right-side preds (and
  symmetric for RIGHT/FULL). Anything we can't push stays above the
  join. Fixes LEFT JOIN + WHERE-on-right tests that were silently
  dropping rows.
- bitmap-semi-join eligibility: any reference to a build-side
  column from the parent disqualifies the rewrite (including the
  join key), since the rewrite discards the right side after
  building the presence bitmap. Mirrors the `(not has-select?)`
  clause in the legacy `query.join` gate.
- build-join-tree: strip table-qualified namespaces off `:on`
  pairs (`:t1/a` → `:a`) so they line up with the unqualified
  column-map keys. Self-join no longer trips on missing columns.
- estimate/sample-estimate: skip when args are non-numeric or the
  column is dict-string. The double-coercing path was throwing on
  string equality predicates we used to short-circuit.
- executor/run-query: thread `::plan/order-only-keys` and
  `::plan/having-only-keys` through `optimize` (preserving the
  top-level metadata) and dissoc them from result rows. Matches the
  legacy `(if (seq _order-only-keys) (mapv #(apply dissoc % …)))`.
- executor/explain-query: include `:n-rows` and `:columns` so
  callers that probe the legacy explain shape keep working.
- query/*use-planner*: default flips to `true`. The full 868-test
  suite (including sqllogictest) is green; A/B vs the legacy path
  is within noise on TPC-H Q1 (B2) and Q6 (B1) at 6M rows.
H2O-Q8 (Top-N per partition via ROW_NUMBER + HAVING) was 2.7× slower
under the planner because `LHaving (LProject (LWindow ...))` had
`PProject` materialize all 6M post-window rows before `PHaving` could
filter them down to the surviving 120K.

Mirror the legacy `q.clj:758-815` window-having pushdown:
- new `window-having-pushdown` pass rewrites
  `LHaving preds (LProject items (LWindow specs in))` →
  `LProject items (LHaving preds (LWindow specs in))` when the
  project items are bare column refs and the having predicates only
  reference columns visible after `LWindow` (window outputs + scan
  inputs). Predicates are normalized in-place since `LHaving` keeps
  the user's raw form.
- `execute-having` gets a column-context fast path: filter on raw
  arrays, gather only surviving indices, return a column ctx for the
  parent `PProject` to finish materializing.

H2O-Q8 NT 6M rows: 1908ms → 395ms (4.8× speedup, 1.7× faster than
legacy's 683ms). Plan after rewrite:
  PProject -> PHaving -> PWindow -> PSIMDFilter -> PScan.

868 tests still green.
`prepare.clj` passes 5a/5b materialize string-producing expressions
in GROUP BY / aggregates / SELECT into temp columns sized to the
PRE-join row count. When a query has both a `:join` and such an
expression, the temp columns end up the wrong length once the join
runs and the executor reads past the array end (or sees stale
data).

The planner doesn't have a post-join materialization pass yet, so
fall back to the legacy `q` body for this combination, which
materializes string exprs at the right point. Symmetric with the
existing `:join + :_anomaly-models` fallback. Tracked as a
follow-up to lift these passes into a post-join planner stage.

868 tests still green.
Implements the eight follow-ups surfaced by the post-flip planner
audit. Every change is paired with the legacy reference it mirrors
or the gap it closes; 868 tests / 4119 assertions remain green.

F1 — LSetOp executor dispatch
  Add `LSetOp` case to `execute-node` (uses
  `requiring-resolve 'stratum.query/q` to avoid the require cycle)
  so `compile-physical` / `explain-query` callers don't crash on
  UNION/INTERSECT/EXCEPT queries. The runtime path in `q` already
  short-circuits set ops; this lets debug entry points share that
  semantics.

F2 — Window-having pushdown brittleness
  Make `normalize-pred` idempotent: detect already-normalized
  `[col op & args]` form via `normalized-pred?` and return early.
  Then move HAVING normalization into `build-logical-plan` (always
  — `prepare-query`'s `pre-normalized?` flag only covers `:where`
  / `:agg`) and drop the redundant re-normalization in both the
  `window-having-pushdown` rewrite and `having-fast-path-on-ctx`
  in the executor. Verified the H2O-Q8 pushdown still fires after
  the change (PProject → PHaving → PWindow plan shape preserved).

F3 — Float `:gte` / `:lte` selectivity overshoot
  `estimate.clj` was computing `:gte t` selectivity as
  `:gt (t-1)`, which is correct for ints but overshoots on
  doubles (e.g. `mn = 4.5`, `t = 5.0` is mistakenly chunk-fully-
  passing). Add direct `zone-map-estimate-gte` and
  `-lte` with inclusive boundary tests and route the dispatcher
  through them.

F4 — `PMaterializeExpr` long-target detection
  `rewrite-expr-group-keys` now emits `:int64` target (group keys
  are discrete by definition); `execute-materialize-expr` honors
  it by calling `eval-expr-to-long`, which returns long[] direct
  for date-trunc / date-add / extract ops. Skips the long → double
  → long round-trip the planner inherited and unblocks the dense
  group-by all-long fast path. CB-Q43 (date-trunc minute → group
  by) lands at 198ms — within noise of legacy 186ms.

F5 — `PFusedExtractCount` emission + executor
  Port the legacy `q.clj:680-908` fused EXTRACT(unit, col) +
  COUNT fast path. New `try-fused-extract-count` in
  `strategy-selection` recognises the post-`expr-materialization`
  shape (LGroupBy over PMaterializeExpr {:op #{:minute :hour
  :second :day-of-week}}) and emits `PFusedExtractCount`,
  bypassing the materialization. Executor case dispatches to
  `ColumnOpsExt/fusedExtractCountDenseParallel` and decodes per
  the legacy block. Closes the CB-Q19 10× gap.

F6 — String predicate sampling
  Replace static heuristics (0.05 / 0.10) for `:like`, `:ilike`,
  `:contains`, `:starts-with`, `:ends-with` and their negations
  with 256-entry dict (or raw `String[]`) sampling. New
  `like->regex` compiles SQL LIKE to a `Pattern` (only `%` / `_`
  are wildcards; everything else is quoted). Wired into
  `estimate-selectivity` between numeric sampling and the
  heuristic fallback.

F7 — NDV from chunk stats
  Add `estimate-ndv` to `estimate.clj`. Dict-encoded string →
  `dict.length`. Indexed int64 → `min(length, max-min+1)` from
  chunk stats. Otherwise the legacy `length/10` heuristic.
  Provides a callable distinct-count primitive the rest of the
  planner can lean on.

F8 — NDV-based join cardinality
  `propagate-est-rows` for `PHashJoin` switches from the
  degenerate `min(L,R) × selectivity` heuristic to the textbook
  formula:

      output = probe_rows × build_rows / max(probe_ndv, build_ndv)

  on the first join key, falling back to the prior heuristic when
  the join key column / scan isn't reachable (multi-key chains,
  wrapped sides). Tightens DP join ordering and dense-vs-hash
  group-by routing.

Bench (T1+T2 olap, 6M rows, planner-on, all PASS):

  H2O-Q8 NT  (window TOP-N)        479ms  (was 1908ms before F2)
  CB-Q19    (extract+count)        4.4ms  (planner ≈ legacy)
  B1/B3/B5/B6 / H2O-Q1..Q10        within ±10% of legacy
  Bitmap semi-join (Q1/Q3)         3-12× faster than legacy
Closes the last two divergence points between the planner and the
legacy `q` body — both queries with `:join` + `:_anomaly-models`
and queries with `:join` + string-producing exprs in `:group` /
`:select` now run end-to-end through the IR planner. The legacy
fallback in `query.clj` is gone; `*use-planner* true` is the only
runtime path.

F9 — Anomaly + join

  - prepare.clj split: `anomaly-spec` does pure rewriting (collect
    every `[:anomaly-* …]` expression, assign synthetic
    `__<op>_<model>` columns, rewrite `:select`/`:where`/`:having`/
    `:order`); `materialize-anomaly` runs the iforest scoring
    against a column ctx. `resolve-anomaly-columns` is now a
    one-liner that calls both for the no-join path.
  - New `LAnomaly` IR record, placed after `LJoin` by
    `build-logical-plan` when the frontend supplied a spec.
    `executor.clj` adds an LAnomaly case that calls
    `materialize-anomaly` against the post-join column ctx and
    returns a column ctx with the synthetic columns added.
  - `column-pruning`'s `collect-all-refs` walks each anomaly
    expression's argument list (long form) or the model's
    `:feature-names` (short form) so the pre-join scans keep the
    columns the iforest needs.

F10 — String-producing exprs + join

  - prepare.clj passes 5a / 5b grew deferred siblings:
    `string-expr-spec-group-agg` and `string-expr-spec-select`
    rewrite the slots and emit `[{:col-name :__str_expr_N
    :expr <normalized>}]` items without calling
    `eval-string-expr`. `prepare-query` returns the items as
    `:string-items`; the no-join path keeps the legacy eager
    materialization untouched.
  - New `LStringMaterialize` IR record. Placed after `LAnomaly`
    (and `LJoin`) by `build-logical-plan` when items are present.
    `executor.clj` runs `expr/eval-string-expr` per item against
    the post-join column ctx.
  - `column-pruning`'s `collect-all-refs` walks each item's
    expression so referenced columns survive on the scans.

The brittle `try/catch normalize-expr` gate I added earlier in
`query.clj` is gone too — the planner handles every shape now,
so there's no fallback to choose between.

Verified end-to-end:
  - sql-anomaly-join-scoring-test: passes (was the canonical
    join + ANOMALY_SCORE test; legacy result matches)
  - GROUP BY UPPER(cat) over a join: planner returns the same
    rows as legacy (synthetic `__str_expr_1` column key)
  - 868 tests / 4119 assertions green
@whilo whilo merged commit adcb6e2 into main May 8, 2026
5 of 6 checks passed
@whilo whilo deleted the feature/planner-completeness branch May 8, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant