Multi-file Parquet reads scale linearly

**Describe what's wrong**

reading N files takes almost N times longer than reading 1 file, despite the pipeline having multiple source threads. it's possible that this problem is specific to how i have my files set up, but none of the settings or output formats seemed to have helped. 

---

**Repro**
  
```python
import os, tempfile, time
from concurrent.futures import ThreadPoolExecutor
import pyarrow as pa, pyarrow.parquet as pq
import chdb

N_ROWS, N_COLS, N_FILES = 500_000, 10, 5

tmpdir = tempfile.mkdtemp()
files = []
for i in range(N_FILES):
    t = pa.table({f"c{j}": pa.array(range(N_ROWS), type=pa.int64()) for j
in range(N_COLS)})
    path = os.path.join(tmpdir, f"file_{i}.parquet")
    pq.write_table(t, path)
    files.append(path)

glob = f"{tmpdir}/*.parquet"

def chdb_arrow(sql):
    conn = chdb.connect(":memory:")
    r = conn.send_query(sql, "Arrow")
    try:
        t = r.record_batch().read_all()
    finally:
        r.close()
    conn.close()
    return t

# arrow output
t0 = time.perf_counter()
chdb_arrow(f"SELECT * FROM file('{glob}', Parquet)")
t_arrow = time.perf_counter() - t0

# null output
conn = chdb.connect(":memory:")
t0 = time.perf_counter()
conn.query(f"SELECT * FROM file('{glob}', Parquet) FORMAT Null")
t_null = time.perf_counter() - t0
conn.close()

# parallel separate connections
def read_one(path):
    return chdb_arrow(f"SELECT * FROM file('{path}', Parquet)")
t0 = time.perf_counter()
with ThreadPoolExecutor(max_workers=N_FILES) as pool:
    tables = list(pool.map(read_one, files))
pa.concat_tables(tables)
t_parallel = time.perf_counter() - t0

for f in files: os.unlink(f)
os.rmdir(tmpdir)
```

  
these are the results i got on a 36-core machine with 5 x 23MB Parquet files:

**Single file baseline:**
Arrow SELECT *:  0.162s
Null  SELECT *:  0.033s
Arrow/Null:      4.9x

**Multi-file (5 files):**
Arrow SELECT *:            0.692s
Null  SELECT *:            0.048s
Arrow parallel conns:      0.315s

for the 5-file query, the pipeline topology is supposedly
```
StorageFileSource × 5 → Resize 5→N → ExpressionTransform × N → LazyOutputFormat
```
but we see that Arrow/Null ratio is ~14.3x  (we would expect ~5x if fully serial), while Arrow/Parallel ratio is 2.2x (probably accounting for connection overhead), and Multi/Single Arrow ratio is ~4.3x  (we would expect ~1x with 5-way parallelism).

i had a cursory look through the codebase, so i may be wrong, but my guess is that when pulling query results through `PullingAsyncPipelineExecutor` (the standard path for any `send_query()` call or client result consumption as far as I can tell), `LazyOutputFormat`'s `ConcurrentBoundedQueue` with hardcoded capacity 2 creates backpressure that stalls producer threads. this affects any query with multiple pipeline sources (multi-file reads, etc.) where output volume is high enough to saturate the queue. the only paths that avoid this are FORMAT Null (uses `CompletedPipelineExecutor`, no queue) and queries where output is small enough that the queue never fills (e.g. aggregations). this would explain why multi-file SELECT * looks like it's serial.

this came up as i was benchmarking chdb's performance against duckdb for reading hive-partitioned Parquet datasets from S3. on full-table scans across multiple partition directories (5 to 60+ files, ~90 MB each), chdb was consistently 5-10x slower than duckdb which is quite significant. initial investigation ruled out S3 throughput (downloading files first and reading locally showed the same gap) and CPU thread settings (`max_parsing_threads`, `max_threads`, `max_download_threads` had no effect), so my guess is that the bottleneck would be in how chdb delivers results to the caller, not in how it reads the data. 

---

env:

* `chdb`: **4.1.0**
* ClickHouse engine version: **26.1.2.1**
* Python: **3.10**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-file Parquet reads scale linearly #539

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Multi-file Parquet reads scale linearly #539

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions