Fix reading Parquet files from S3 by mneedham · Pull Request #562 · chdb-io/chdb

mneedham · 2026-04-17T16:06:23Z

Summary

Two fixes to enable reading Parquet files from S3.

1. `read_parquet` doesn't work with S3 URLs

In [2]: pd.read_parquet('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet')
Out[2]: E [chDB] Query failed: Code: 400. DB::ErrnoException: Cannot stat file /Users/markhneedham/projects/videos/20241029-chdbPandas/s3:/datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet: , errno: 2, strerror: No such file or directory: The table structure cannot be extracted from a Parquet format file. You can specify the structure manually: (in file/uri /Users/markhneedham/projects/videos/20241029-chdbPandas/s3:/datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet). (CANNOT_STAT)
...
SQL: DESCRIBE file('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet', 'Parquet')

s3:// paths were being routed to from_file(), which treats them as local paths. Fixed by routing s3:// paths in read_parquet() to DataStore.from_s3() instead.

2. `DataStore.from_s3()` crashes without an explicit format

In [3]: pd.DataStore.from_s3('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet', nosign=True)
Out[3]: DataStore(execution failed: 'NoneType' object has no attribute 'lower')

Root cause: from_s3() stores {"format": None} in table function params when no format is specified. dict.get("format", "") returns None (not "") when the key exists with value None — the default only applies when the key is absent entirely. None.lower() then crashes in preserves_row_order().

Fixed by using (self.params.get("format") or "").lower() in both FileTableFunction and S3TableFunction.

Changes

datastore/pandas_api.py: Route s3:// paths in read_parquet() to DataStore.from_s3()
datastore/table_functions.py: Fix None.lower() crash in FileTableFunction.preserves_row_order() and S3TableFunction.preserves_row_order()

Test plan

read_parquet("s3://...") routes to S3 table function instead of crashing
DataStore.from_s3("s3://...", nosign=True) no longer raises 'NoneType' object has no attribute 'lower'
DataStore.from_file("data.parquet") with no explicit format still works

🤖 Generated with Claude Code

…functions dict.get("key", default) returns None when the key exists with value None, not the default. Using `or ""` handles the None case correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CLAassistant · 2026-04-17T16:06:34Z

All committers have signed the CLA.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…support Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wudidapaopao · 2026-04-20T07:04:49Z

Thanks for your contribution and the fix!

mneedham · 2026-04-20T10:50:20Z

I'm not sure how to sign the CLA?

auxten · 2026-04-20T13:05:13Z

I'm not sure how to sign the CLA?

I just made the CLA of chDB identical with ClickHouse. So, everyone need re sign the CLA.

It might be just very slow loading. How about wait for more time?

mneedham · 2026-04-20T15:19:33Z

I tried again now and it came up with the form!

fix: handle None format in preserves_row_order for File and S3 table …

1e93c0f

…functions dict.get("key", default) returns None when the key exists with value None, not the default. Using `or ""` handles the None case correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mneedham and others added 2 commits April 17, 2026 17:07

feat: support s3:// paths in read_parquet

8762339

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: add regression tests for None format crash and read_parquet S3 …

913ad75

…support Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mneedham changed the title ~~fix: handle None format in preserves_row_order~~ Fix reading Parquet files from S3 Apr 17, 2026

Merge branch 'main' into fix/preserves-row-order-none-format

e50353b

wudidapaopao approved these changes Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix reading Parquet files from S3#562

Fix reading Parquet files from S3#562
mneedham wants to merge 4 commits intochdb-io:mainfrom
mneedham:fix/preserves-row-order-none-format

mneedham commented Apr 17, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 17, 2026 •

edited

Loading

Uh oh!

wudidapaopao commented Apr 20, 2026

Uh oh!

mneedham commented Apr 20, 2026

Uh oh!

auxten commented Apr 20, 2026

Uh oh!

mneedham commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mneedham commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. read_parquet doesn't work with S3 URLs

2. DataStore.from_s3() crashes without an explicit format

Changes

Test plan

Uh oh!

CLAassistant commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wudidapaopao commented Apr 20, 2026

Uh oh!

mneedham commented Apr 20, 2026

Uh oh!

auxten commented Apr 20, 2026

Uh oh!

mneedham commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mneedham commented Apr 17, 2026 •

edited

Loading

1. `read_parquet` doesn't work with S3 URLs

2. `DataStore.from_s3()` crashes without an explicit format

CLAassistant commented Apr 17, 2026 •

edited

Loading