Skip to content

Order of aggregation with skipna matters #10759

@DominikStiller

Description

@DominikStiller

What is your issue?

This is not a bug report, rather a pitfall that should maybe be documented.

I noticed that the order of aggregations matters if nans are present, skipna=True (default), and the aggregation is done in separate calls. This is only a problem for aggregations that scale with N, e.g., mean, but not sum.

Example:

da = xr.DataArray(np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, np.nan, 9]
]), dims=["height", "lat"])
da.mean(["lat", "height"]) -> 4.625 (correct)
da.mean(["height", "lat"]) -> 4.625 (correct)
da.mean("lat").mean("height") -> 5.0
da.mean("height").mean("lat") -> 4.5

The same is the case when taking nanmeans with numpy, so this is not an xarray-only issue. The reason is that all data in the second operation have equal weights, even though they do not represent the same number of data points in the first operation (some rows/columns have 2, other 3 data points).

Xarray seems to be behaving correctly, and there may be no way around it without carrying weights across operations. However, I was still surprised by this behavior, so it might be worth documenting a warning since it is not uncommon that users perform aggregations in multiple steps, and skipna is True by default. The differences are largest when averaging over dimensions along which the number of nans varies a lot.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions