Stack and unstack with PandasMultiIndex

Stack and unstack with PandasMultiIndex#

Highlights#

  1. An xarray.indexes.PandasMultiIndex is associated with multiple coordinate variables sharing the same dimension.

  2. Create PandasMultiIndex from PandasIndex using xarray.Dataset.stack() and convert back with xarray.Dataset.unstack().

  3. Labels of coordinates associated with a PandasMultiIndex can be passed all at once to .sel.

Example#

Let’s open a tutorial dataset.

import xarray as xr
ds_air = xr.tutorial.open_dataset("air_temperature")
ds_air
<xarray.Dataset> Size: 31MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Data variables:
    air      (time, lat, lon) float64 31MB ...
Attributes: (5)

Stack / Unstack#

Stacking the β€œlat” and β€œlon” dimensions of the example dataset results here in the corresponding β€œlat” and β€œlon” stacked coordinates both associated with a PandasMultiIndex by default. The underlying data are reshaped to collapse the lat and lon dimensions to a new space dimension.

stacked = ds_air.stack(space=("lat", "lon"))
stacked
<xarray.Dataset> Size: 31MB
Dimensions:  (time: 2920, space: 1325)
Coordinates:
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
  * space    (space) object 11kB MultiIndex
  * lat      (space) float32 5kB 75.0 75.0 75.0 75.0 ... 15.0 15.0 15.0 15.0
  * lon      (space) float32 5kB 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Data variables:
    air      (time, space) float64 31MB 241.2 242.5 243.5 ... 296.5 296.2 295.7
Attributes: (5)

The multi-index allows retrieving the original, unstacked dataset where the β€œlat” and β€œlon” dimension coordinates have their own PandasIndex.

unstacked = stacked.unstack("space")
unstacked
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes: (5)

Assigning#

We can also directly associate a PandasMultiIndex with existing coordinates sharing the same dimension.

ds_air = (
    ds_air.assign_coords(season=ds_air.time.dt.season)
    .rename_vars(time="datetime")
    .drop_indexes("datetime")
)

ds_air
<xarray.Dataset> Size: 31MB
Dimensions:   (time: 2920, lat: 25, lon: 53)
Coordinates:
    datetime  (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    season    (time) <U3 35kB 'DJF' 'DJF' 'DJF' 'DJF' ... 'DJF' 'DJF' 'DJF'
  * lat       (lat) float32 100B 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Dimensions without coordinates: time
Data variables:
    air       (time, lat, lon) float64 31MB ...
Attributes: (5)
multi_indexed = ds_air.set_xindex(
    ["season", "datetime"], xr.indexes.PandasMultiIndex
)
multi_indexed
<xarray.Dataset> Size: 31MB
Dimensions:   (time: 2920, lat: 25, lon: 53)
Coordinates:
  * time      (time) object 23kB MultiIndex
  * season    (time) <U3 35kB 'DJF' 'DJF' 'DJF' 'DJF' ... 'DJF' 'DJF' 'DJF'
  * datetime  (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
  * lat       (lat) float32 100B 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Data variables:
    air       (time, lat, lon) float64 31MB ...
Attributes: (5)

Indexing#

Contrary to what is shown in the default PandasIndex example, it is here possible to provide labels to xarray.Dataset.sel() for both of the multi-index time coordinates.

multi_indexed.sel(season="DJF", datetime="2013")
<xarray.Dataset> Size: 4MB
Dimensions:   (time: 360, lat: 25, lon: 53)
Coordinates:
  * time      (time) object 3kB MultiIndex
  * season    (time) <U3 4kB 'DJF' 'DJF' 'DJF' 'DJF' ... 'DJF' 'DJF' 'DJF' 'DJF'
  * datetime  (time) datetime64[ns] 3kB 2013-01-01 ... 2013-12-31T18:00:00
  * lat       (lat) float32 100B 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Data variables:
    air       (time, lat, lon) float64 4MB ...
Attributes: (5)

Chaining .sel calls for those coordinates each with their own index would yield equivalent results, though.

single_indexed = ds_air.set_xindex("datetime").set_xindex("season")

single_indexed.sel(season="DJF").sel(datetime="2013")
<xarray.Dataset> Size: 4MB
Dimensions:   (time: 360, lat: 25, lon: 53)
Coordinates:
  * datetime  (time) datetime64[ns] 3kB 2013-01-01 ... 2013-12-31T18:00:00
  * season    (time) <U3 4kB 'DJF' 'DJF' 'DJF' 'DJF' ... 'DJF' 'DJF' 'DJF' 'DJF'
  * lat       (lat) float32 100B 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon       (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Dimensions without coordinates: time
Data variables:
    air       (time, lat, lon) float64 4MB ...
Attributes: (5)

Assigning a pandas.MultiIndex#

It is easy to wrap an existing pandas.MultiIndex object into a new Xarray Dataset or DataArray.

import pandas as pd

midx = pd.MultiIndex.from_product(
    [["a", "b"], [1, 2]], names=("foo", "bar")
)
midx
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           names=['foo', 'bar'])

This can be done via xarray.Coordinates.from_pandas_multiindex().

midx_coords = xr.Coordinates.from_pandas_multiindex(midx, dim="x")

ds = xr.Dataset(coords=midx_coords)
ds
<xarray.Dataset> Size: 96B
Dimensions:  (x: 4)
Coordinates:
  * x        (x) object 32B MultiIndex
  * foo      (x) object 32B 'a' 'a' 'b' 'b'
  * bar      (x) int64 32B 1 2 1 2
Data variables:
    *empty*