Parquet `DeltaBitPackDecoder::skip` could panic on "non-standard" miniblocks

**Describe the bug**
The `skip` function in `DeltaBitPackDecoder` utilizes a `skip_buffer` to decode values needed to keep `last_value` valid. This buffer has a fixed size of `32` for INT32 and `64` for INT64. While these are the sizes used by this crate when writing, there is nothing in the specification regarding sizes beyond the stipulation that block size must be a multiple of 128, and the number of elements in a miniblock is a multiple of 32. It is conceivable that a writer could be aggressive in its sizing when encoding long runs of the same value.

Consider a column of all `0`. `first_value` in the delta header would be `0`, all deltas from that value would likewise be `0`, so each miniblock would have a bit_width of `0` as well. A "normal" encoder would emit multiple blocks of 128 values, with 4 32-element miniblocks, but only the headers would be written as there would be no bit-packed data. A clever writer could see this, and instead of wastefully writing many headers, it could so alter the initial parameters to write a single block with very large miniblocks.

To write 20000 zeros, one could have:
```
delta header:
  block_size = 20096
  num_mini_blocks = 4
  total_values = 20000
  first_value = 0

block header:
  min_delta = 0
  bit_widths = [0, 0, 0, 0]

end-of-page
```
The number of values per miniblock is `5024`, which is divisible by 32, and the blocksize `20096` is likewise divisible by 128. Trying to create a slice of 5024 elements from a vector of size 32 will panic.

Now this particular scenario would not be an issue since the `skip_buffer` is no longer used when the miniblock bitwidth is `0`, but there is nothing at all stopping some future writer deciding it can more efficiently encode miniblocks of 128 elements say, and setting the block size to 512.

To be safe, I think the result of `block_size / mini_blocks_per_block` should be used when sizing `skip_buffer`. The only downside is that this leaves open the possibility of exhausting memory.


**To Reproduce**
First modify the delta binary packed encoder
```patch
diff --git a/parquet/src/encodings/encoding/mod.rs b/parquet/src/encodings/encoding/mod.rs
index eeabcf4ba5..9aa7f89af0 100644
--- a/parquet/src/encodings/encoding/mod.rs
+++ b/parquet/src/encodings/encoding/mod.rs
@@ -324,11 +324,7 @@ impl<T: DataType> DeltaBitPackEncoder<T> {
         Self::assert_supported_type();
 
         // Size miniblocks so that they can be efficiently decoded
-        let mini_block_size = match T::T::PHYSICAL_TYPE {
-            Type::INT32 => 32,
-            Type::INT64 => 64,
-            _ => unreachable!(),
-        };
+        let mini_block_size = 128;
 
         let num_mini_blocks = DEFAULT_NUM_MINI_BLOCKS;
         let block_size = mini_block_size * num_mini_blocks;
```

Then run the arrow_reader benchmark
```sh
% cargo bench -p parquet --all-features --bench arrow_reader -- '/Int32Array/binary.*skip'
    Finished `bench` profile [optimized] target(s) in 0.44s
     Running benches/arrow_reader.rs (/Users/seidl/src/arrow-rs/target/release/deps/arrow_reader-76405fea46be62fc)
Benchmarking arrow_array_reader/Int32Array/binary packed skip, mandatory, no NULLs: Warming up for 3.0000 s
thread 'main' (11785144) panicked at /Users/seidl/src/arrow-rs/parquet/src/encodings/decoding.rs:881:48:
range end index 128 out of range for slice of length 32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p parquet --bench arrow_reader`
```
**Expected behavior**
`skip()` should not panic

**Additional context**
This is a theoretical possibility. I know of no writers in widespread use that do anything like what is described here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet `DeltaBitPackDecoder::skip` could panic on "non-standard" miniblocks #9793

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parquet DeltaBitPackDecoder::skip could panic on "non-standard" miniblocks #9793

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Parquet `DeltaBitPackDecoder::skip` could panic on "non-standard" miniblocks #9793