Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 10 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage
Filesystem interface to Azure Blob and Data Lake Storage (Gen2)
------------------------------------------------------------


Expand All @@ -16,20 +16,10 @@ or

`conda install -c conda-forge adlfs`

The `adl://` and `abfs://` protocols are included in fsspec's known_implementations registry
The `az://` and `abfs://` protocols are included in fsspec's known_implementations registry
in fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could remove this mention of ancient fsspec, I doubt things here would still work that far back


To use the Gen1 filesystem:

```python
import dask.dataframe as dd

storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)
```

To use the Gen2 filesystem you can use the protocol `abfs` or `az`:
To connect to Blobs or Azure Data Lake Storage (ADLS) Gen2 filesystem you can use the protocol `abfs` or `az`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abfs: also for for adls2?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. abfs:// supports both non-hierarchical blob and adlfs gen 2 accounts.

Right now adlfs, only uses the blob endpoint no matter the account type, which will functionally work correctly for both types of accounts since ADLS Gen 2 is built on Blob.

However, this also means that adlfs is losing out on potential optimizations especially on renames, recursive deletes if it were to detect whether a storage account was ADLS Gen 2 enabled and use the ADLS Gen 2 endpoint instead. This is a pattern followed by other filesystem Azure Storage tools like the ABFS driver and BlobFuse where it promotes ADLS Gen 2 as a feature of Azure Blob and smartly determines which endpoints/operations to use depending on the Storage account type + operation, which reduces the cognitive overhead of having to think through account types and what API set needs to be accessed (e.g., blob endpoint or ADLS Gen 2 endpoint). So, getting adlfs moving in that direction would see those benefits and also be more consistent with other Azure Storage related software.


```python
import dask.dataframe as dd
Expand All @@ -41,6 +31,7 @@ ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage

Accepted protocol / uri formats include:
'PROTOCOL://container/path-part/file'
'PROTOCOL://[email protected]/path-part/file'
'PROTOCOL://[email protected]/path-part/file'

or optionally, if AZURE_STORAGE_ACCOUNT_NAME and an AZURE_STORAGE_<CREDENTIAL> is
Expand All @@ -58,15 +49,9 @@ ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', stora

Details
-------
The package includes pythonic filesystem implementations for both
Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate
interactions between both Azure Datalake implementations and Dask. This is done leveraging the
[intake/filesystem_spec](https://github.com/intake/filesystem_spec/tree/master/fsspec) base class and Azure Python SDKs.
The package includes pythonic filesystem implementations for both [Azure Blobs](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview) and [Azure Datalake Gen2 (ADLS)](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction), that facilitate interactions between these implementations and Dask. This is done leveraging the [intake/filesystem_spec](https://github.com/intake/filesystem_spec/tree/master/fsspec) base class and Azure Python SDKs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Operations against both Gen1 Datalake currently only work with an Azure ServicePrincipal
with suitable credentials to perform operations on the resources of choice.

Operations against the Gen2 Datalake are implemented by leveraging [Azure Blob Storage Python SDK](https://github.com/Azure/azure-sdk-for-python).
Operations against Azure Blobs and ADLS Gen2 are implemented by leveraging [Azure Blob Storage Python SDK](https://github.com/Azure/azure-sdk-for-python).

### Setting credentials
The `storage_options` can be instantiated with a variety of keyword arguments depending on the filesystem. The most commonly used arguments are:
Expand All @@ -81,7 +66,7 @@ The `storage_options` can be instantiated with a variety of keyword arguments de
anonymous access will not be attempted. Otherwise the value for `anon` resolves to True.
- `location_mode`: valid values are "primary" or "secondary" and apply to RA-GRS accounts

For more argument details see all arguments for [`AzureBlobFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L328) and [`AzureDatalakeFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L69).
For more argument details see all arguments for [`AzureBlobFileSystem` here](https://fsspec.github.io/adlfs/api/#adlfs.AzureBlobFileSystem)

The following environmental variables can also be set and picked up for authentication:
- "AZURE_STORAGE_CONNECTION_STRING"
Expand All @@ -102,3 +87,6 @@ The filesystem can be instantiated for different use cases based on a variety of
The `AzureBlobFileSystem` accepts [all of the Async BlobServiceClient arguments](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python).

By default, write operations create BlockBlobs in Azure, which, once written can not be appended. It is possible to create an AppendBlob using `mode="ab"` when creating and operating on blobs. Currently, AppendBlobs are not available if hierarchical namespaces are enabled.

### Older versions
ADLS Gen1 filesystem has officially been [retired](https://learn.microsoft.com/en-us/lifecycle/products/azure-data-lake-storage-gen1). Hence the adl:// method, which was designed to connect to ADLS Gen1 is obsolete.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ write_to = "adlfs/_version.py"

[project]
name = "adlfs"
description = "Access Azure Datalake Gen1 with fsspec and dask"
description = "Access Azure Blobs and Data Lake Storage (ADLS) Gen2 with fsspec and dask"
readme = "README.md"
license = {text = "BSD"}
maintainers = [{ name = "Greg Hayes", email = "[email protected]"}]
Expand Down