Python: x-array Q&A’s

During the pandemic I ran an online x-array training session for a major commodities broker that wanted to explore this incredibly useful Python package for processing big weather data. After making a presentation and a hands-on demonstration, I ran a Q&A for the audience of software engineers who were already familiar with Pandas. Here is a transcript of the Q&A (note that this was now quite a while ago and there have been some major updates to x-array including much tighter integration of cloud-native data structures):

First, here is the link to the xarray docs and github repository. Install into your Python environment xarray using conda or pip.

Q1. How does Pandas multi-indexing translate to xarray? Equivalent APIs on both?

In Pandas, multi-indexing (a.k.a heirarchical indexing) enables multidimensional data to be stored in 2D dataframes by adding multiple index columns. Dimensions 2 – N are present in effect, because of nested levels of indexing, but not explicitly as they are in xarray.

There is native support for Pandas multi-indexing in xarray. Converting a multi-index Pandas dataframe into an xarray dataset is achievable using the built-in Pandas function

df.to_xarray()

In this case each of the index columns is interpreted as a separate dimension in the resulting xr.dataset. Here is an example:

df = PD.read_csv(filename,  sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1])

df2.columns = ['COL1', 'COL2', 'COL3']

df2.index.names = ['CARS', 'LAPTOPS']

ds = df2.to_xarray()

The resulting xarray dataset will have 2 dimensions: dim1 = CARS, dim2 = LAPTOPS. The coordinates are the values in those index columns in the dataframe, e.g. coords on CARS could be [Fiesta, Focus, Corsa, Civic] and the coords on LAPTOPS could be [Thinkpad, Surface, MacBook].

Xarray structures can also persist multi-indexing, where coordinates can have nested levels of additional coordinates. For example, in the following dataarray, the coordinate x is composed of sub-coordinates C1 which is a text index with values ‘a’ and ‘b’, and C2 is a numeric index containing values 0 and 1.

Coordinates:
  * x        (x) MultiIndex
  - C1      (x) object 'a' 'b'
  - C2      (x) int64 0 1
  * y        (y) int64 0 1 2

In this case, we can select and slice the dataarray using ds.loc() and ds.sel(), e.g.

da.sel(x=[('a', 0), ('b', 1)])

or multi-index coordinates can just be parsed as individual kwargs:

da.sel(C1='a', C2=0)

Most of the related functionality is the same in xarray as it is for Pandas, and the APIs are mostly consistent between the two packages; however, there are some nuances that separate the two. One is that xarray’s stack function does not automatically drop missing values in the way that Pandas does – instead it persists NaNs into the stacked array – the reason for this is that consistently shaped arrays are required for the native parallelization using Dask.

The bottom line is that there is robust native support for Pandas multi-indexing in xarray, although some more complex use-cases such as flattening arrays to send to scikit-learn or parallelizing with Dask require some idiosyncratic data manipulation, but this is all very well documented at xarray.pydata.org. The majority of the time multi-indexing behaviours are consistent between the two packages.

Q2. Does Dask allow arbitrary reduction functions?

Dask itself can parallelize operations over arrays or it can be used to parallelize custom workloads unrelated to array manipulation, for example by decorating normal Python functions with @dask.delayed. This makes the function lazy, and instead of running the function in loops, you can distribute all the iterations across threads/cores/machines. This makes Dask applicable to almost any function.

In the context of xarray, as I mentioned in the presentation, there are many built-in functions that automatically parallelize using Dask as long as the ‘chunks’ argument is provided when the dataset is created– including many reduction functions such as ds.sum(), ds.max(), ds.min(), ds.mean(), ds.reduce(). Others can be parallelized using the apply_ufunc() method and then the final option is to remove the Dask arrays from the xarray wrapper and use Dask decoupled from xarray perhaps using the .delayed() method described above.

Q3. Can Dask handle more complex data dependencies, e.g. finite difference?

I have not used xarray/dask for finite difference problems myself yet, so can’t really give detailed advice on it, but I am aware of the ds.diff() function that computes the nth order finite difference along a given dimension, taking n and the dimension name as kwargs.

Q4. The apply_ufunc operation – what are the arguments to x and y? Individual floats or can they be high level objects?

With apply_ufunc, the necessary arguments are the name of the function, plus the arguments required by that function. Those can be integers, floats or higher level objects such as arrays etc, depending upon the custom function you wish to apply. So, for example, if we wish to apply a vector norm function over the ‘x’ dimension of an array in a dataset, we can define the function, wrap it with apply_ufunc() and then pass the dataset (ds1) as an argument.

#example from http://xarray.pydata.org/en/stable/computation.html

def vector_norm(array, dim, ord=None):
    return xr.apply_ufunc(np.linalg.norm, array,
                          input_core_dims=[[dim]],
                          kwargs={'ord': ord, 'axis': -1})

vector_norm(ds1, dim='x')

The input_core_dims argument is the dimension that should remain unchanged throughout the operation – i.e. the dimension that the operation is applied over.

Q5. What are the xarray equivalents to Pandas: a) stack/unstack; b) shift; c) renaming dims/coords; d) time resampling?

a) Stack/unstack

xarray also has built-in stack/unstack functions – they are simply ds.stack() and ds.unstack(). This operation can be applied over datasets or dataarrays and they allow any number of existing dimensions to be stacked into one single new dimension, and the reverse. This also works for multi-indexed arrays. Overall the functionality is near identical to Pandas.

b) shift

Again, shifting dimensions in xarray is almost identical to Pandas. There is a ds.shift() function that will shift an array by a given integer offset, where positive integers shift the values to the right and negative integers shift the values to the left, and any spaces that are opened up by the shifting are filled with NaNs. It is important to note that in xarray only the values are shifted – the coordinates remain fixed, so the values move relative to the coordinates.

c) renaming dims/coords

renaming dims and coords is achieved using the ds.rename() function with a dict object as an

argument, for example to rename dimension ‘x’ to ‘longitude:

renamed_ds = ds.rename({‘x’: ‘longitude’})

There were some recent open github issues regarding automatic renaming of coordinates when dimensions were renamed, which was not always desired behaviour, but it seems that releases since the end of 2019 have resolved that and separate renaming of dims and coords seems to now be default behaviour.

d) time resampling

Time series are obviously a major use-case for xarray. Datetime arrays can be parsed directly into xarray datasets and dataarrays, either ones created using Pandas/NumPy or if date-time information is stored in the NetCDF file being opened with xarray, it is automatically decoded into a date-time data type. Then all the functionality for time manipulation that one would expect from Pandas is also available in xarray. Time resampling in xarray is powerful and syntactically simple, for example if we have hourly data and we wish to downsample to 6-hourly we can simply use:

resampled_ds = ds.resample(time=’6H’)

and xarray will return an array with 6 hour time resolution.

For upsamplig, there are several methods available for temporal interpolation, all of which are also available in Pandas with essentially the same API. Those are:

ffill: forward fill (propagate last valid value into next empty space)

bfill: backfill (propagate next valid value into last empty space)

pad: insert given value

nearest: nearest neighbour routine

interpolate: apply SciPy’s interp1D function.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s