-
Notifications
You must be signed in to change notification settings - Fork 109
Revise README for Azure Blob and Data Lake Storage #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
-Removed references to ADLS Gen1, which is retired. -Added information on connecting to Blobs as well as ADLS Gen2
Updated project description to reflect support for ADLS Gen2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just had a few small comments.
README.md
Outdated
- `location_mode`: valid values are "primary" or "secondary" and apply to RA-GRS accounts | ||
|
||
For more argument details see all arguments for [`AzureBlobFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L328) and [`AzureDatalakeFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L69). | ||
For more argument details see all arguments for [`AzureBlobFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L328) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of linking to the code, let's instead link to the rendered HTML docs: https://fsspec.github.io/adlfs/api/#adlfs.AzureBlobFileSystem. Mainly, there is a hardcoded commit SHA here so if arguments are added in the future, this link will drift away from what is in main.
README.md
Outdated
By default, write operations create BlockBlobs in Azure, which, once written can not be appended. It is possible to create an AppendBlob using `mode="ab"` when creating and operating on blobs. Currently, AppendBlobs are not available if hierarchical namespaces are enabled. | ||
|
||
### Older versions | ||
ADLS Gen1 filesystem has officially been [retired](https://learn.microsoft.com/en-us/lifecycle/products/azure-data-lake-storage-gen1)). Hence the older versions of this package, which was designed to connect to ADLS Gen1 is obsolete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of suggestions on this section:
- It looks like there is an extra
)
that we should remove at the end of the link. - Maybe we can remove this second sentence and just add on: "and support in
adlfs
is obsolete"? Mainly, the AzureBlobFileSystem has been around for several years so technically these older versions should work just fine if customers were usingaz
orabfs
.
Sort of related, but something we should do outside of this PR... Right now there is a deprecation warning that ADLS Gen 1 support will be moved to an optional dependency. It would probably make sense to update this warning to say that it will be removed in a future version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! @martindurant this should be good for a review.
|
||
The `adl://` and `abfs://` protocols are included in fsspec's known_implementations registry | ||
The `az://` and `abfs://` protocols are included in fsspec's known_implementations registry | ||
in fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could remove this mention of ancient fsspec, I doubt things here would still work that far back
``` | ||
|
||
To use the Gen2 filesystem you can use the protocol `abfs` or `az`: | ||
To connect to Blobs or Azure Data Lake Storage (ADLS) Gen2 filesystem you can use the protocol `abfs` or `az`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abfs: also for for adls2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. abfs://
supports both non-hierarchical blob and adlfs gen 2 accounts.
Right now adlfs, only uses the blob endpoint no matter the account type, which will functionally work correctly for both types of accounts since ADLS Gen 2 is built on Blob.
However, this also means that adlfs is losing out on potential optimizations especially on renames, recursive deletes if it were to detect whether a storage account was ADLS Gen 2 enabled and use the ADLS Gen 2 endpoint instead. This is a pattern followed by other filesystem Azure Storage tools like the ABFS driver and BlobFuse where it promotes ADLS Gen 2 as a feature of Azure Blob and smartly determines which endpoints/operations to use depending on the Storage account type + operation, which reduces the cognitive overhead of having to think through account types and what API set needs to be accessed (e.g., blob endpoint or ADLS Gen 2 endpoint). So, getting adlfs moving in that direction would see those benefits and also be more consistent with other Azure Storage related software.
Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate | ||
interactions between both Azure Datalake implementations and Dask. This is done leveraging the | ||
[intake/filesystem_spec](https://github.com/intake/filesystem_spec/tree/master/fsspec) base class and Azure Python SDKs. | ||
The package includes pythonic filesystem implementations for both [Azure Blobs](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview) and [Azure Datalake Gen2 (ADLS)](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction), that facilitate interactions between these implementations and Dask. This is done leveraging the [intake/filesystem_spec](https://github.com/intake/filesystem_spec/tree/master/fsspec) base class and Azure Python SDKs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-Removed references to ADLS Gen1, which is retired.
-Added information on connecting to Blobs as well as ADLS Gen2