Skip to content

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Sep 23, 2025

This PR implements the fsspec optimization described in the NVIDIA blog post for optimizing access to Parquet data with fsspec. The optimization enables precaching for remote file systems (e.g. S3, GCS) to improve performance when reading Parquet files.

Problem

When reading Parquet files from remote storage systems like S3 or GCS, LSDB users experience slower performance due to multiple HTTP requests for metadata and data access. The fsspec library provides built-in optimization capabilities that can significantly reduce these requests through intelligent precaching.

Solution

This PR adds optional fsspec optimization support that can be enabled through either a function parameter or environment variable:

# Enable via parameter
catalog = lsdb.open_catalog(path, enable_fsspec_optimization=True)

# Enable via environment variable  
os.environ["LSDB_ENABLE_FSSPEC_OPTIMIZATION"] = "true"
catalog = lsdb.open_catalog(path)

Implementation Details

The optimization works by adding {"precache_options": {"method": "parquet"}} to the open_file_options parameter when reading Parquet files. This leverages fsspec's built-in precaching mechanism to:

  • Reduce the number of HTTP requests to remote storage
  • Improve metadata access performance for Parquet files
  • Use intelligent caching strategies optimized for Parquet file structure

Key Features

  • Optional and configurable: Disabled by default for backward compatibility
  • Multiple control methods: Can be enabled via function parameter or environment variable
  • User option preservation: Does not override existing user-provided open_file_options or precache_options
  • Comprehensive environment variable support: Accepts true/false, 1/0, yes/no, on/off (case-insensitive)
  • Performance testing ready: Allows easy comparison of performance with and without optimization

Testing

Added comprehensive test coverage including:

  • Parameter-based control testing
  • Environment variable handling with various formats
  • User option preservation verification
  • Integration testing with existing functionality
  • Backward compatibility validation

All existing tests continue to pass, ensuring no regressions were introduced.

Usage Examples

# Basic usage with optimization
catalog = lsdb.open_catalog("s3://bucket/catalog", enable_fsspec_optimization=True)

# Combined with existing options (preserved)
catalog = lsdb.open_catalog(
    "s3://bucket/catalog",
    enable_fsspec_optimization=True,
    open_file_options={"block_size": 64 * 1024 * 1024}
)

# Environment variable control for testing
export LSDB_ENABLE_FSSPEC_OPTIMIZATION=true
# Now all open_catalog calls will use optimization

This implementation provides the foundation for performance testing and optimization of remote Parquet file access in LSDB workflows.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

- Add enable_fsspec_optimization parameter to open_catalog function
- Add LSDB_ENABLE_FSSPEC_OPTIMIZATION environment variable support
- Update HatsLoadingConfig to handle optimization settings
- Apply optimization by adding precache_options to open_file_options
- Add comprehensive tests for new functionality
- Preserve user-provided open_file_options and precache_options

Co-authored-by: gitosaurus <[email protected]>
@Copilot Copilot AI changed the title [WIP] Read this article <https://developer.nvidia.com/blog/optimizing-access-to-parquet-data-with-fsspec/> and then implement the optimization it describes for the LSDB project, when opening Parquet files as part of the call chain that starts with `open_cata... Add fsspec optimization for improved Parquet file access performance Sep 23, 2025
@Copilot Copilot AI requested a review from gitosaurus September 23, 2025 16:58
Copilot finished work on behalf of gitosaurus September 23, 2025 16:58
@gitosaurus
Copy link
Contributor

We decided against threading a new keyword argument into the call stack here. The fsspec optimization being done in nested-pandas will depend on detecting the Parquet files as remote.

@gitosaurus gitosaurus closed this Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants