Skip to content

Proposal to allow third-party engines for readers and writers #61584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue Jun 6, 2025 · 0 comments
Open

Proposal to allow third-party engines for readers and writers #61584

datapythonista opened this issue Jun 6, 2025 · 0 comments
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label

Comments

@datapythonista
Copy link
Member

In PDEP-9 it was discussed the possibility of allowing third-party packages to automatically add pandas.read_<format> functions and DataFrame.to_<format> methods. There was a main challenges that made the proposal not move forward: the complexity of managing multiple packages for the same format (conflicting names, differences in signatures...).

What I propose here is similar, but not to register the readers/writers for whole formats, but engines of the existing formats instead. This is less ambitious, since it doesn't allow adding new formats to pandas (e.g. pandas.read_dicom, a format for medical data), but it still have the rest of the advantages of PDEP-9:

  • It still allows third-party packages to provide the code for pandas readers/writers (e.g. a faster csv reader, a new excel reader wrapping another excel library...)
  • It opens the door to removing from our code base connectors that can be better maintained elsewhere. As an example, engines like fastparquet for parquet, as well as others, are basically a mapping between our functions signature and their functions signatures, with a bit of extra logic. I think the engines are way more likely to need changes because changes in the wrapped library, than in our function signature, so to me it makes things simpler and easier to maintain if the engine was part of the fastparquet and pyarrow libraries. Moving engines out of pandas is something for the future, and it can be discussed individually, since it probably makes sense to keep many, and move out some
  • There would be no need to deal with optional dependencies for the engines using this system. Dealing with optional dependencies adds complexity that we can avoid
  • It would simplify our dependencies significantly (if moving engines out of pandas happens), as well as our tests. We had problems in the past because we skip tests depending on whether a library can be imported or not. And we were for a while not running many pandas tests. Having less optional dependencies would help prevent this sort of problems.
  • Conflicts in this case seem unlikely. Most of the engines are names after the library they wrap, as opposed to libraries "fighting" to register a format name. There could still be in some cases, but only for users with both the conflicting packages installed, and we can warn in this case.
  • We will continue to control the signature for all readers and writers, which for the users means that the formats are fixed, and every format has a unique signature which is documented in our docs
  • In some cases we already use **kwargs for engine specific parameters. This provides extra flexibility while keeping most of the signature unified

Implementing this would have no impact to users unless they call a reader/writer with an engine value that is unknown. At that point instead of raising as now, we would first check for registered entry points, and if one exist for the format (e.g. "csv") and the provided engine name (e.g. "arrow-rs", a possible new reader based in Rust's Arrow implementation, if someone implements that), then the function provided by the entry point would handle the request.

Only small drawback I can find is that since engines would be generic, the API pages of the documentation won't be able to provide engine specific information for the engines not in pandas itself. I think this is very reasonable, and we can keep a registry of known connectors in the Ecosystem page with links to their docs, as we usually do.

@datapythonista datapythonista added Enhancement IO Data IO issues that don't fit into a more specific label API Design labels Jun 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

1 participant