Skip to content

chore: _read_gbq_colab supports querying a pandas DataFrame #1801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jun 13, 2025

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Jun 9, 2025

Work-in-progress:

  • Finish unit tests
  • Implement "happy path" (query after load job)
  • Implement dry run

Fixes internal issue b/406027008, b/409317722 🦕

@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Jun 9, 2025
@tswast tswast marked this pull request as ready for review June 10, 2025 20:19
@tswast tswast requested review from a team as code owners June 10, 2025 20:19
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: s Pull request size is small. labels Jun 10, 2025
@tswast tswast requested a review from TrevorBergeron June 10, 2025 20:19
# Note: this uses the sql from the executor, so is coupled tightly to execution
# implementaton. It will reference cached tables instead of original data sources.
# Maybe should just compile raw BFET? Depends on user intent.
sql = self.session._executor.to_sql(
array_value.rename_columns(substitutions), enable_cache=enable_cache
Copy link
Collaborator Author

@tswast tswast Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename_columns call moved to _array_value_for_output

@tswast
Copy link
Collaborator Author

tswast commented Jun 13, 2025

FAILED tests/unit/pandas/io/test_api.py::test_read_gbq_colab_dry_run_doesnt_call_set_location - AssertionError: Expected

I suspect this is due to the dependency on global state for "is_started". I'll see if I can mock something more out to make this more independent.

Edit: I think this has been fixed. I added an "else" block to avoid setting the default location if its a dry run.

@tswast
Copy link
Collaborator Author

tswast commented Jun 13, 2025

Windows failures look like real ones too:

FAILED tests/unit/session/test_read_gbq_colab.py::test_read_gbq_colab_includes_formatted_values_in_dry_run[True]
FAILED tests/unit/session/test_read_gbq_colab.py::test_read_gbq_colab_includes_formatted_values_in_dry_run[False]
E       TypeError: Unexpected Arrow data type int32. Share your usecase with the BigQuery DataFrames team at the [https://bit.ly/bigframes-feedback](https://www.google.com/url?q=https://bit.ly/bigframes-feedback&sa=D) survey. You are currently running BigFrames version 2.6.0.

Edit: I believe this has been fixed by 07b0fa9

Comment on lines 44 to 47
def _pandas_df_to_sql_dry_run(pd_df: pandas.DataFrame) -> str:
managed_table = bigframes.core.local_data.ManagedArrowTable.from_pandas(pd_df)
bqschema = managed_table.schema.to_bigquery()
return bigquery_schema.to_sql_dry_run(bqschema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the schema here might drift a bit from the eventual "real" schema when it comes to duplicate labels? Elsewhere, I think we disambiguate just before calling ManagedArrowTable.from_pandas, and we might need to push that logic into from_pandas itself

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the case. In some cases I think we try to preserve the pandas-y names. In this case we want it to be compatible with SQL, so I can run the de-duper before calling from_pandas.

@@ -444,6 +444,23 @@ def dtype_for_etype(etype: ExpressionType) -> Dtype:
if mapping.arrow_dtype is not None
}

# Include types that aren't 1:1 to BigQuery but allowed to be loaded in to BigQuery:
_ARROW_TO_BIGFRAMES.update(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe only use this extended definition for cases where we want to be lenient(eg accepting external data sources), while still being strict for most internal stuff? worry this could allows us to accidentally have our internal types drift in places.

idx_labels,
)

def to_view(self, include_index: bool) -> bigquery.TableReference:
def to_view(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since not necessarily a view anymore, maybe like to_placeholder_sql or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case there is a session, so it's either a view or a table. Renamed to _to_placeholder_table since in the BQ API views are table resources.

@tswast tswast merged commit 8ebfa57 into main Jun 13, 2025
24 checks passed
@tswast tswast deleted the b406027008-read_gbq_colab-local-df branch June 13, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants