Describe what's wrong
reading N files takes almost N times longer than reading 1 file, despite the pipeline having multiple source threads. it's possible that this problem is specific to how i have my files set up, but none of the settings or output formats seemed to have helped.
Repro
import os, tempfile, time
from concurrent.futures import ThreadPoolExecutor
import pyarrow as pa, pyarrow.parquet as pq
import chdb
N_ROWS, N_COLS, N_FILES = 500_000, 10, 5
tmpdir = tempfile.mkdtemp()
files = []
for i in range(N_FILES):
t = pa.table({f"c{j}": pa.array(range(N_ROWS), type=pa.int64()) for j
in range(N_COLS)})
path = os.path.join(tmpdir, f"file_{i}.parquet")
pq.write_table(t, path)
files.append(path)
glob = f"{tmpdir}/*.parquet"
def chdb_arrow(sql):
conn = chdb.connect(":memory:")
r = conn.send_query(sql, "Arrow")
try:
t = r.record_batch().read_all()
finally:
r.close()
conn.close()
return t
# arrow output
t0 = time.perf_counter()
chdb_arrow(f"SELECT * FROM file('{glob}', Parquet)")
t_arrow = time.perf_counter() - t0
# null output
conn = chdb.connect(":memory:")
t0 = time.perf_counter()
conn.query(f"SELECT * FROM file('{glob}', Parquet) FORMAT Null")
t_null = time.perf_counter() - t0
conn.close()
# parallel separate connections
def read_one(path):
return chdb_arrow(f"SELECT * FROM file('{path}', Parquet)")
t0 = time.perf_counter()
with ThreadPoolExecutor(max_workers=N_FILES) as pool:
tables = list(pool.map(read_one, files))
pa.concat_tables(tables)
t_parallel = time.perf_counter() - t0
for f in files: os.unlink(f)
os.rmdir(tmpdir)
these are the results i got on a 36-core machine with 5 x 23MB Parquet files:
Single file baseline:
Arrow SELECT *: 0.162s
Null SELECT *: 0.033s
Arrow/Null: 4.9x
Multi-file (5 files):
Arrow SELECT *: 0.692s
Null SELECT *: 0.048s
Arrow parallel conns: 0.315s
for the 5-file query, the pipeline topology is supposedly
StorageFileSource × 5 → Resize 5→N → ExpressionTransform × N → LazyOutputFormat
but we see that Arrow/Null ratio is ~14.3x (we would expect ~5x if fully serial), while Arrow/Parallel ratio is 2.2x (probably accounting for connection overhead), and Multi/Single Arrow ratio is ~4.3x (we would expect ~1x with 5-way parallelism).
i had a cursory look through the codebase, so i may be wrong, but my guess is that when pulling query results through PullingAsyncPipelineExecutor (the standard path for any send_query() call or client result consumption as far as I can tell), LazyOutputFormat's ConcurrentBoundedQueue with hardcoded capacity 2 creates backpressure that stalls producer threads. this affects any query with multiple pipeline sources (multi-file reads, etc.) where output volume is high enough to saturate the queue. the only paths that avoid this are FORMAT Null (uses CompletedPipelineExecutor, no queue) and queries where output is small enough that the queue never fills (e.g. aggregations). this would explain why multi-file SELECT * looks like it's serial.
this came up as i was benchmarking chdb's performance against duckdb for reading hive-partitioned Parquet datasets from S3. on full-table scans across multiple partition directories (5 to 60+ files, ~90 MB each), chdb was consistently 5-10x slower than duckdb which is quite significant. initial investigation ruled out S3 throughput (downloading files first and reading locally showed the same gap) and CPU thread settings (max_parsing_threads, max_threads, max_download_threads had no effect), so my guess is that the bottleneck would be in how chdb delivers results to the caller, not in how it reads the data.
env:
chdb: 4.1.0
- ClickHouse engine version: 26.1.2.1
- Python: 3.10
Describe what's wrong
reading N files takes almost N times longer than reading 1 file, despite the pipeline having multiple source threads. it's possible that this problem is specific to how i have my files set up, but none of the settings or output formats seemed to have helped.
Repro
these are the results i got on a 36-core machine with 5 x 23MB Parquet files:
Single file baseline:
Arrow SELECT *: 0.162s
Null SELECT *: 0.033s
Arrow/Null: 4.9x
Multi-file (5 files):
Arrow SELECT *: 0.692s
Null SELECT *: 0.048s
Arrow parallel conns: 0.315s
for the 5-file query, the pipeline topology is supposedly
but we see that Arrow/Null ratio is ~14.3x (we would expect ~5x if fully serial), while Arrow/Parallel ratio is 2.2x (probably accounting for connection overhead), and Multi/Single Arrow ratio is ~4.3x (we would expect ~1x with 5-way parallelism).
i had a cursory look through the codebase, so i may be wrong, but my guess is that when pulling query results through
PullingAsyncPipelineExecutor(the standard path for anysend_query()call or client result consumption as far as I can tell),LazyOutputFormat'sConcurrentBoundedQueuewith hardcoded capacity 2 creates backpressure that stalls producer threads. this affects any query with multiple pipeline sources (multi-file reads, etc.) where output volume is high enough to saturate the queue. the only paths that avoid this are FORMAT Null (usesCompletedPipelineExecutor, no queue) and queries where output is small enough that the queue never fills (e.g. aggregations). this would explain why multi-file SELECT * looks like it's serial.this came up as i was benchmarking chdb's performance against duckdb for reading hive-partitioned Parquet datasets from S3. on full-table scans across multiple partition directories (5 to 60+ files, ~90 MB each), chdb was consistently 5-10x slower than duckdb which is quite significant. initial investigation ruled out S3 throughput (downloading files first and reading locally showed the same gap) and CPU thread settings (
max_parsing_threads,max_threads,max_download_threadshad no effect), so my guess is that the bottleneck would be in how chdb delivers results to the caller, not in how it reads the data.env:
chdb: 4.1.0