You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In PDEP-9 it was discussed the possibility of allowing third-party packages to automatically add pandas.read_<format> functions and DataFrame.to_<format> methods. There was a main challenges that made the proposal not move forward: the complexity of managing multiple packages for the same format (conflicting names, differences in signatures...).
What I propose here is similar, but not to register the readers/writers for whole formats, but engines of the existing formats instead. This is less ambitious, since it doesn't allow adding new formats to pandas (e.g. pandas.read_dicom, a format for medical data), but it still have the rest of the advantages of PDEP-9:
It still allows third-party packages to provide the code for pandas readers/writers (e.g. a faster csv reader, a new excel reader wrapping another excel library...)
It opens the door to removing from our code base connectors that can be better maintained elsewhere. As an example, engines like fastparquet for parquet, as well as others, are basically a mapping between our functions signature and their functions signatures, with a bit of extra logic. I think the engines are way more likely to need changes because changes in the wrapped library, than in our function signature, so to me it makes things simpler and easier to maintain if the engine was part of the fastparquet and pyarrow libraries. Moving engines out of pandas is something for the future, and it can be discussed individually, since it probably makes sense to keep many, and move out some
There would be no need to deal with optional dependencies for the engines using this system. Dealing with optional dependencies adds complexity that we can avoid
It would simplify our dependencies significantly (if moving engines out of pandas happens), as well as our tests. We had problems in the past because we skip tests depending on whether a library can be imported or not. And we were for a while not running many pandas tests. Having less optional dependencies would help prevent this sort of problems.
Conflicts in this case seem unlikely. Most of the engines are names after the library they wrap, as opposed to libraries "fighting" to register a format name. There could still be in some cases, but only for users with both the conflicting packages installed, and we can warn in this case.
We will continue to control the signature for all readers and writers, which for the users means that the formats are fixed, and every format has a unique signature which is documented in our docs
In some cases we already use **kwargs for engine specific parameters. This provides extra flexibility while keeping most of the signature unified
Implementing this would have no impact to users unless they call a reader/writer with an engine value that is unknown. At that point instead of raising as now, we would first check for registered entry points, and if one exist for the format (e.g. "csv") and the provided engine name (e.g. "arrow-rs", a possible new reader based in Rust's Arrow implementation, if someone implements that), then the function provided by the entry point would handle the request.
Only small drawback I can find is that since engines would be generic, the API pages of the documentation won't be able to provide engine specific information for the engines not in pandas itself. I think this is very reasonable, and we can keep a registry of known connectors in the Ecosystem page with links to their docs, as we usually do.
The text was updated successfully, but these errors were encountered:
In PDEP-9 it was discussed the possibility of allowing third-party packages to automatically add
pandas.read_<format>
functions andDataFrame.to_<format>
methods. There was a main challenges that made the proposal not move forward: the complexity of managing multiple packages for the same format (conflicting names, differences in signatures...).What I propose here is similar, but not to register the readers/writers for whole formats, but engines of the existing formats instead. This is less ambitious, since it doesn't allow adding new formats to pandas (e.g.
pandas.read_dicom
, a format for medical data), but it still have the rest of the advantages of PDEP-9:**kwargs
for engine specific parameters. This provides extra flexibility while keeping most of the signature unifiedImplementing this would have no impact to users unless they call a reader/writer with an engine value that is unknown. At that point instead of raising as now, we would first check for registered entry points, and if one exist for the format (e.g. "csv") and the provided engine name (e.g. "arrow-rs", a possible new reader based in Rust's Arrow implementation, if someone implements that), then the function provided by the entry point would handle the request.
Only small drawback I can find is that since engines would be generic, the API pages of the documentation won't be able to provide engine specific information for the engines not in pandas itself. I think this is very reasonable, and we can keep a registry of known connectors in the Ecosystem page with links to their docs, as we usually do.
The text was updated successfully, but these errors were encountered: