-
Notifications
You must be signed in to change notification settings - Fork 1
Add automatic fsspec optimization for remote storage using fsspec.parquet #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add automatic fsspec optimization for remote storage using fsspec.parquet #369
Conversation
Co-authored-by: gitosaurus <[email protected]>
I don't like the idea of the keyword argument. We need to implement it through |
@copilot This fails at runtime with the error
which appears to arise from here:
and ultimately fails here:
Evidently the keyword argument needs to be removed before it is applied to that point in the stack. |
Co-authored-by: gitosaurus <[email protected]>
I've implemented the solution using
The implementation is more robust and addresses the runtime errors while providing better performance for remote storage. Commit: 0e1eb74
... Fixed! The issue was that I've completely rewritten the approach to use
All existing tests pass and the runtime error is resolved. Commit: 0e1eb74 |
@copilot Remove the keyword argument |
…atic for remote resources Co-authored-by: gitosaurus <[email protected]>
Done! I've removed the Changes made:
The API is now simpler - users don't need to specify any options, and optimization happens transparently for remote resources. Commit: (being pushed) |
Enables automatic fsspec optimization in
read_parquet()
to improve performance when reading Parquet files from remote storage systems (S3, GCS, HTTP, etc.). This change enables nested-pandas to work seamlessly with the fsspec optimization described in the NVIDIA blog post.Problem
When reading Parquet files from remote storage (S3, HTTPS, GCS), performance can be significantly improved by using fsspec's intelligent precaching capabilities. Previously, nested-pandas didn't leverage these optimizations for remote file access.
Solution
This PR modifies nested-pandas to automatically use
fsspec.parquet.open_parquet_file
for remote storage access:fsspec.parquet.open_parquet_file
for better performanceKey Features
fsspec.parquet.open_parquet_file
for remote storage with intelligent precachingUsage Examples
Implementation Details
read_parquet()
to automatically usefsspec.parquet.open_parquet_file
for remote storage_should_use_fsspec_optimization()
to detect when optimization should be applied based on URL/path_read_with_fsspec_optimization()
to handle optimized reading with graceful fallbackReadFewColumnsHTTPSWithOptimization
to compare performanceTesting
All existing tests continue to pass, ensuring no regressions. New tests cover:
Benchmarking
Added
ReadFewColumnsHTTPSWithOptimization
benchmark class to compare performance between standard PyArrow reading and fsspec-optimized reading for remote storage.This change enables nested-pandas to automatically benefit from fsspec's optimization features for remote storage while maintaining full compatibility with existing code and requiring no user intervention.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.