Skip to content

Parquet DeltaBitPackDecoder::skip could panic on "non-standard" miniblocks #9793

@etseidl

Description

@etseidl

Describe the bug
The skip function in DeltaBitPackDecoder utilizes a skip_buffer to decode values needed to keep last_value valid. This buffer has a fixed size of 32 for INT32 and 64 for INT64. While these are the sizes used by this crate when writing, there is nothing in the specification regarding sizes beyond the stipulation that block size must be a multiple of 128, and the number of elements in a miniblock is a multiple of 32. It is conceivable that a writer could be aggressive in its sizing when encoding long runs of the same value.

Consider a column of all 0. first_value in the delta header would be 0, all deltas from that value would likewise be 0, so each miniblock would have a bit_width of 0 as well. A "normal" encoder would emit multiple blocks of 128 values, with 4 32-element miniblocks, but only the headers would be written as there would be no bit-packed data. A clever writer could see this, and instead of wastefully writing many headers, it could so alter the initial parameters to write a single block with very large miniblocks.

To write 20000 zeros, one could have:

delta header:
  block_size = 20096
  num_mini_blocks = 4
  total_values = 20000
  first_value = 0

block header:
  min_delta = 0
  bit_widths = [0, 0, 0, 0]

end-of-page

The number of values per miniblock is 5024, which is divisible by 32, and the blocksize 20096 is likewise divisible by 128. Trying to create a slice of 5024 elements from a vector of size 32 will panic.

Now this particular scenario would not be an issue since the skip_buffer is no longer used when the miniblock bitwidth is 0, but there is nothing at all stopping some future writer deciding it can more efficiently encode miniblocks of 128 elements say, and setting the block size to 512.

To be safe, I think the result of block_size / mini_blocks_per_block should be used when sizing skip_buffer. The only downside is that this leaves open the possibility of exhausting memory.

To Reproduce
First modify the delta binary packed encoder

diff --git a/parquet/src/encodings/encoding/mod.rs b/parquet/src/encodings/encoding/mod.rs
index eeabcf4ba5..9aa7f89af0 100644
--- a/parquet/src/encodings/encoding/mod.rs
+++ b/parquet/src/encodings/encoding/mod.rs
@@ -324,11 +324,7 @@ impl<T: DataType> DeltaBitPackEncoder<T> {
         Self::assert_supported_type();
 
         // Size miniblocks so that they can be efficiently decoded
-        let mini_block_size = match T::T::PHYSICAL_TYPE {
-            Type::INT32 => 32,
-            Type::INT64 => 64,
-            _ => unreachable!(),
-        };
+        let mini_block_size = 128;
 
         let num_mini_blocks = DEFAULT_NUM_MINI_BLOCKS;
         let block_size = mini_block_size * num_mini_blocks;

Then run the arrow_reader benchmark

% cargo bench -p parquet --all-features --bench arrow_reader -- '/Int32Array/binary.*skip'
    Finished `bench` profile [optimized] target(s) in 0.44s
     Running benches/arrow_reader.rs (/Users/seidl/src/arrow-rs/target/release/deps/arrow_reader-76405fea46be62fc)
Benchmarking arrow_array_reader/Int32Array/binary packed skip, mandatory, no NULLs: Warming up for 3.0000 s
thread 'main' (11785144) panicked at /Users/seidl/src/arrow-rs/parquet/src/encodings/decoding.rs:881:48:
range end index 128 out of range for slice of length 32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p parquet --bench arrow_reader`

Expected behavior
skip() should not panic

Additional context
This is a theoretical possibility. I know of no writers in widespread use that do anything like what is described here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions