Skip to content

Fix reading Parquet files from S3#562

Open
mneedham wants to merge 4 commits intochdb-io:mainfrom
mneedham:fix/preserves-row-order-none-format
Open

Fix reading Parquet files from S3#562
mneedham wants to merge 4 commits intochdb-io:mainfrom
mneedham:fix/preserves-row-order-none-format

Conversation

@mneedham
Copy link
Copy Markdown

@mneedham mneedham commented Apr 17, 2026

Summary

Two fixes to enable reading Parquet files from S3.

1. read_parquet doesn't work with S3 URLs

In [2]: pd.read_parquet('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet')
Out[2]: E [chDB] Query failed: Code: 400. DB::ErrnoException: Cannot stat file /Users/markhneedham/projects/videos/20241029-chdbPandas/s3:/datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet: , errno: 2, strerror: No such file or directory: The table structure cannot be extracted from a Parquet format file. You can specify the structure manually: (in file/uri /Users/markhneedham/projects/videos/20241029-chdbPandas/s3:/datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet). (CANNOT_STAT)
...
SQL: DESCRIBE file('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet', 'Parquet')

s3:// paths were being routed to from_file(), which treats them as local paths. Fixed by routing s3:// paths in read_parquet() to DataStore.from_s3() instead.

2. DataStore.from_s3() crashes without an explicit format

In [3]: pd.DataStore.from_s3('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet', nosign=True)
Out[3]: DataStore(execution failed: 'NoneType' object has no attribute 'lower')

Root cause: from_s3() stores {"format": None} in table function params when no format is specified. dict.get("format", "") returns None (not "") when the key exists with value None — the default only applies when the key is absent entirely. None.lower() then crashes in preserves_row_order().

Fixed by using (self.params.get("format") or "").lower() in both FileTableFunction and S3TableFunction.

Changes

  • datastore/pandas_api.py: Route s3:// paths in read_parquet() to DataStore.from_s3()
  • datastore/table_functions.py: Fix None.lower() crash in FileTableFunction.preserves_row_order() and S3TableFunction.preserves_row_order()

Test plan

  • read_parquet("s3://...") routes to S3 table function instead of crashing
  • DataStore.from_s3("s3://...", nosign=True) no longer raises 'NoneType' object has no attribute 'lower'
  • DataStore.from_file("data.parquet") with no explicit format still works

🤖 Generated with Claude Code

…functions

dict.get("key", default) returns None when the key exists with value None,
not the default. Using `or ""` handles the None case correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 17, 2026

CLA assistant check
All committers have signed the CLA.

mneedham and others added 2 commits April 17, 2026 17:07
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…support

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mneedham mneedham changed the title fix: handle None format in preserves_row_order Fix reading Parquet files from S3 Apr 17, 2026
@wudidapaopao
Copy link
Copy Markdown
Contributor

Thanks for your contribution and the fix!

@mneedham
Copy link
Copy Markdown
Author

I'm not sure how to sign the CLA?

2026-04-20_11-50-03

@auxten
Copy link
Copy Markdown
Member

auxten commented Apr 20, 2026

I'm not sure how to sign the CLA?

2026-04-20_11-50-03

I just made the CLA of chDB identical with ClickHouse. So, everyone need re sign the CLA.

It might be just very slow loading. How about wait for more time?

@mneedham
Copy link
Copy Markdown
Author

I tried again now and it came up with the form!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants