-
Notifications
You must be signed in to change notification settings - Fork 50
chore: _read_gbq_colab
supports querying a pandas DataFrame
#1801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Note: this uses the sql from the executor, so is coupled tightly to execution | ||
# implementaton. It will reference cached tables instead of original data sources. | ||
# Maybe should just compile raw BFET? Depends on user intent. | ||
sql = self.session._executor.to_sql( | ||
array_value.rename_columns(substitutions), enable_cache=enable_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename_columns
call moved to _array_value_for_output
…df' into b406027008-read_gbq_colab-local-df
I suspect this is due to the dependency on global state for "is_started". I'll see if I can mock something more out to make this more independent. Edit: I think this has been fixed. I added an "else" block to avoid setting the default location if its a dry run. |
Windows failures look like real ones too:
Edit: I believe this has been fixed by 07b0fa9 |
bigframes/core/pyformat.py
Outdated
def _pandas_df_to_sql_dry_run(pd_df: pandas.DataFrame) -> str: | ||
managed_table = bigframes.core.local_data.ManagedArrowTable.from_pandas(pd_df) | ||
bqschema = managed_table.schema.to_bigquery() | ||
return bigquery_schema.to_sql_dry_run(bqschema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the schema here might drift a bit from the eventual "real" schema when it comes to duplicate labels? Elsewhere, I think we disambiguate just before calling ManagedArrowTable.from_pandas
, and we might need to push that logic into from_pandas
itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on the case. In some cases I think we try to preserve the pandas-y names. In this case we want it to be compatible with SQL, so I can run the de-duper before calling from_pandas.
bigframes/dtypes.py
Outdated
@@ -444,6 +444,23 @@ def dtype_for_etype(etype: ExpressionType) -> Dtype: | |||
if mapping.arrow_dtype is not None | |||
} | |||
|
|||
# Include types that aren't 1:1 to BigQuery but allowed to be loaded in to BigQuery: | |||
_ARROW_TO_BIGFRAMES.update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe only use this extended definition for cases where we want to be lenient(eg accepting external data sources), while still being strict for most internal stuff? worry this could allows us to accidentally have our internal types drift in places.
bigframes/core/blocks.py
Outdated
idx_labels, | ||
) | ||
|
||
def to_view(self, include_index: bool) -> bigquery.TableReference: | ||
def to_view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since not necessarily a view anymore, maybe like to_placeholder_sql
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case there is a session, so it's either a view or a table. Renamed to _to_placeholder_table
since in the BQ API views are table resources.
Work-in-progress:
Fixes internal issue b/406027008, b/409317722 🦕