Skip to content

parquet: add Decoder::scan_filtered for miniblock-level predicate pushdown in DeltaBitPackDecoder #9785

@sahuagin

Description

@sahuagin

Range predicates (e.g. value >= X) applied to DELTA_BINARY_PACKED columns
currently require decoding every value before filtering. The DELTA format's
per-miniblock headers (min_delta, bit_width) make it possible to compute a
conservative [lo, hi] range for an entire miniblock without decoding
individual values — if the predicate rejects the range, the miniblock can be
skipped in O(1).

Proposed addition: A new provided method Decoder::scan_filtered(num_values, out, predicate) on the Decoder trait. The default implementation ignores the
predicate and decodes everything (safe fallback for all encodings that don't
carry per-region metadata). DeltaBitPackDecoder overrides it to:

  1. Check first_value as a point range [v, v] before entering the miniblock loop.
  2. Per miniblock, compute a conservative [lo, hi] using last_value,
    min_delta, bit_width, and the count of values in the miniblock.
  3. If the predicate rejects [lo, hi], skip the miniblock (O(1) for bw=0;
    decode-to-track-last_value for mid-stream bw>0 misses; bit-skip for terminal
    bw>0 misses).
  4. If the predicate accepts, decode and emit the miniblock normally.

Returns (values_emitted, values_consumed).

Measured improvement:

  • Wall-time scan on 1M-row monotone DELTA column: 1.96ms → 470µs (4.2x)

Note: scan_filtered is intentionally conservative — a range miss on a
miniblock is always safe (false positives are possible, false negatives are not).
Callers that need exact per-value filtering should apply a second pass on the
emitted values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions