-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Open
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesRay Data-related issuesstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
What happened + What you expected to happen
I am trying to read a parquet file stored unintentionally in s3 with hive style partitioning. When setting partitioning=None and columns=["my_column"] read_parquet files with this stack trace:
Traceback (most recent call last):
File "/root/lab42_vr/test.py", line 6, in <module>
ds = ray.data.read_parquet(input_s3_path, partitioning=None, columns=columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/ray/data/read_api.py", line 950, in read_parquet
datasource = ParquetDatasource(
^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 262, in __init__
data_columns, partition_columns = _infer_data_and_partition_columns(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 817, in _infer_data_and_partition_columns
return data_columns, partition_columns
^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'partition_columns' where it is not associated with a value
It would probably make sense to allow setting partitioning=None along with columns.
Versions / Dependencies
I verified this bug is present in at least both ray 2.46.0 and 2.48.0.
Reproduction script
import ray
import pandas as pd
import os
# Create sample data with a text column
sample_data = {
"text": [
"This is the first sample text document for testing purposes.",
"Here is another example of text data that could be used in research.",
"Machine learning models often require large amounts of text data for training.",
"Natural language processing is a fascinating field of artificial intelligence.",
"This parquet file contains sample text data for reproducible testing.",
"Ray Data provides efficient distributed data processing capabilities.",
"Parquet is a columnar storage format that works well with big data tools.",
"Testing with synthetic data ensures reproducibility across environments.",
"Data preprocessing is a crucial step in machine learning pipelines.",
"Sample datasets help validate code functionality before production use."
]
}
# Create DataFrame and save as parquet
df = pd.DataFrame(sample_data)
parquet_file_path = "sample_text_data.parquet"
df.to_parquet(parquet_file_path, index=False)
print(f"Created parquet file: {parquet_file_path}")
print(f"File size: {os.path.getsize(parquet_file_path)} bytes")
# Read the parquet file with Ray, this should fail
ds = ray.data.read_parquet(parquet_file_path, partitioning=None, columns=["text"])
ds.materialize()
print(f"Successfully loaded {ds.count()} rows from the parquet file")
print("Sample data:")
print(ds.take(2))
# Clean up the created file (optional)
os.remove(parquet_file_path)
Issue Severity
There is a simple workaround for this: don't specify partitioning=None when setting columns arg, e.g.
ds = ray.data.read_parquet(parquet_file_path, columns=["text"])
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesRay Data-related issuesstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)