Skip to content

feat!: add allow_large_results option to read_gbq_query, aligning with bpd.options.compute.allow_large_results option #1935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Jul 24, 2025

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Jul 24, 2025
@tswast tswast marked this pull request as ready for review July 24, 2025 18:27
@tswast tswast requested review from a team as code owners July 24, 2025 18:27
@tswast tswast requested a review from chelsea-lin July 24, 2025 18:27
@tswast tswast changed the title feat: add allow_large_results option to read_gbq_query feat: add allow_large_results option to read_gbq_query. Set to False to enable faster queries Jul 24, 2025
@@ -215,6 +217,7 @@ def read_gbq(
use_cache: Optional[bool] = None,
col_order: Iterable[str] = (),
dry_run: bool = False,
allow_large_results: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should allow_large_results also default to None? This would allow it to inherit its value from ComputeOptions.allow_large_results.

Copy link
Collaborator Author

@tswast tswast Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do so, it'd technicaly be a breaking change. Users with query results > 10 GB would have to set this option to True.

Might be worth it though for the consistency with other places, though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@tswast
Copy link
Collaborator Author

tswast commented Aug 21, 2025

Doctest looks like a real failure:

_______________ [doctest] bigframes.pandas.io.api.read_gbq_query _______________
[gw3] linux -- Python 3.12.7 /tmpfs/src/github/python-bigquery-dataframes/.nox/doctest/bin/python
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> df.head(2)
Differences (unified diff with -expected +actual):
    @@ -1,6 +1,5 @@
    -         pitcherFirstName pitcherLastName  averagePitchSpeed
    -rowindex
    -1                Albertin         Chapman          96.514113
    -2                 Zachary         Britton          94.591039
    +   rowindex pitcherFirstName pitcherLastName  averagePitchSpeed
    +0         1         Albertin         Chapman          96.514113
    +1         2          Zachary         Britton          94.591039
     <BLANKLINE>
    -[2 rows x 3 columns]
    +[2 rows x 4 columns]

It seems I'm not setting the index correctly.

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Aug 21, 2025
@tswast
Copy link
Collaborator Author

tswast commented Aug 21, 2025

It seems I'm not setting the index correctly.

This has been fixed. I've confirmed the doctest passes locally and have added two tests for the columns and index_col arguments.

@tswast tswast requested a review from chelsea-lin August 21, 2025 16:07
@tswast
Copy link
Collaborator Author

tswast commented Aug 21, 2025

Looks like I have some more failing tests to address:

FAILED tests/system/small/test_session.py::test_read_gbq_wildcard[all-read_gbq]
FAILED tests/system/small/test_session.py::test_read_gbq_wildcard[all-read_gbq_table]
FAILED tests/system/small/test_session.py::test_read_gbq_table_dry_run_with_max_results
FAILED tests/system/small/test_session.py::test_read_gbq_wildcard[max_results-read_gbq]
FAILED tests/system/small/test_session.py::test_read_gbq_wildcard[max_results-read_gbq_table]
FAILED tests/system/small/test_session.py::test_read_gbq_with_configuration[config2]
FAILED tests/system/small/test_session.py::test_read_gbq_w_max_results[two_rows_in_table]
FAILED tests/system/small/test_session.py::test_read_gbq_w_script_no_select

Getting started on that now.

chelsea-lin
chelsea-lin previously approved these changes Aug 21, 2025
chelsea-lin
chelsea-lin previously approved these changes Aug 21, 2025
chelsea-lin
chelsea-lin previously approved these changes Aug 21, 2025
@tswast tswast enabled auto-merge (squash) August 21, 2025 19:13
@tswast tswast changed the title feat: add allow_large_results option to read_gbq_query. Set to False to enable faster queries feat!: add allow_large_results option to read_gbq_query. Set to False to enable faster queries Aug 21, 2025
@tswast tswast disabled auto-merge August 21, 2025 20:00
@tswast
Copy link
Collaborator Author

tswast commented Aug 21, 2025

A lot more failures now.

FAILED tests/system/small/test_session.py::test_read_gbq_duplicate_columns_xfail[query_input_columns_dup]
FAILED tests/system/small/test_unordered.py::test_unordered_mode_read_gbq - A...
FAILED tests/system/small/test_dataframe_io.py::test_to_sql_query_unnamed_index_included
FAILED tests/system/small/test_dataframe_io.py::test_to_sql_query_named_index_included
FAILED tests/system/small/bigquery/test_vector_search.py::test_vector_search_different_params_with_query
FAILED tests/system/small/bigquery/test_vector_search.py::test_vector_search_df_with_query_column_to_search
FAILED tests/system/small/ml/test_core.py::test_model_centroids - AssertionEr...
FAILED tests/system/small/ml/test_decomposition.py::test_pca_components_ - As...
FAILED tests/system/small/ml/test_core.py::test_pca_model_principal_components
FAILED tests/system/small/ml/test_core.py::test_model_forecast[id] - Assertio...
FAILED tests/system/small/ml/test_forecasting.py::test_arima_plus_predict_params[id]
FAILED tests/system/small/ml/test_decomposition.py::test_pca_predict - Assert...
FAILED tests/system/small/ml/test_forecasting.py::test_arima_plus_score_series[id]
FAILED tests/system/small/ml/test_forecasting.py::test_arima_plus_score[id]
FAILED tests/system/small/ml/test_forecasting.py::test_arima_plus_predict_default[id]
FAILED tests/system/small/ml/test_forecasting.py::test_arima_plus_predict_explain_default[id]

I'll take a look at this when I get back I guess.

@tswast
Copy link
Collaborator Author

tswast commented Aug 21, 2025

___________________ test_to_sql_query_unnamed_index_included ___________________
[gw19] linux -- Python 3.11.10 /tmpfs/src/github/python-bigquery-dataframes/.nox/system-3-11/bin/python

session = <bigframes.session.Session object at 0x15486703c150>
scalars_df_default_index =    bool_col                                          bytes_col    date_col  \
0      True                             ...99+00:00  0 days 00:00:00.000004  
8                              <NA>         5 days 00:00:00  

[9 rows x 15 columns]
scalars_pandas_df_default_index =    bool_col  ...            duration_col
0      True  ...  0 days 00:00:00.000004
1     False  ...       -1 days +23:5...          <NA>
7      True  ...  0 days 00:00:00.000004
8     False  ...         5 days 00:00:00

[9 rows x 15 columns]

    def test_to_sql_query_unnamed_index_included(
        session: bigframes.Session,
        scalars_df_default_index: bpd.DataFrame,
        scalars_pandas_df_default_index: pd.DataFrame,
    ):
        bf_df = scalars_df_default_index.reset_index(drop=True).drop(columns="duration_col")
        sql, idx_ids, idx_labels = bf_df._to_sql_query(include_index=True)
        assert len(idx_labels) == 1
        assert len(idx_ids) == 1
        assert idx_labels[0] is None
        assert idx_ids[0].startswith("bigframes")
    
        pd_df = scalars_pandas_df_default_index.reset_index(drop=True).drop(
            columns="duration_col"
        )
        roundtrip = session.read_gbq(sql, index_col=idx_ids)
        roundtrip.index.names = [None]
>       utils.assert_pandas_df_equal(roundtrip.to_pandas(), pd_df, check_index_type=False)

[tests/system/small/test_dataframe_io.py:1036](https://cs.corp.google.com/piper///depot/google3/tests/system/small/test_dataframe_io.py?l=1036): 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[bigframes/testing/utils.py:99](https://cs.corp.google.com/piper///depot/google3/bigframes/testing/utils.py?l=99): in assert_pandas_df_equal
    pd.testing.assert_frame_equal(df0, df1, **kwargs)
testing.pyx:55: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError: DataFrame.index are different
E   
E   DataFrame.index values are different (88.88889 %)
E   [left]:  Index([8, 4, 0, 3, 5, 1, 2, 6, 7], dtype='Int64')
E   [right]: RangeIndex(start=0, stop=9, step=1)
E   At positional index 0, first diff: 8 != 0

testing.pyx:173: AssertionError

I think this is a real failure. We aren't sorting by the index_col.

@tswast tswast enabled auto-merge (squash) August 21, 2025 22:42
chelsea-lin
chelsea-lin previously approved these changes Aug 21, 2025
@tswast tswast disabled auto-merge August 21, 2025 23:13
@tswast tswast changed the title feat!: add allow_large_results option to read_gbq_query. Set to False to enable faster queries feat!: add allow_large_results option to read_gbq_query, aligning with bpd.options.compute.allow_large_results option Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants