Skip to content

Open a Zarr Store with Xarray

Overview

Zarr stores on Borealis appear as directories whose paths follow the Zarr spec (.zattrs, .zgroup, var/0.0, etc.). dataversefs exposes them as a virtual filesystem so Xarray can traverse the tree without downloading the entire dataset upfront.

Basic Usage

Create an fs object first, then pass fs.get_mapper() to xr.open_zarr.

import fsspec
import xarray as xr

fs = fsspec.filesystem(
    "dataverse",
    host="borealisdata.ca",
    pid="doi:10.5683/SP3/7HF3IC",
)

store = fs.get_mapper("dual_heading.zarr")
ds = xr.open_zarr(store, consolidated=False)
print(ds)

!!! important Always pass consolidated=False. Borealis datasets do not include a consolidated .zmetadata file by default, so Zarr must query each metadata key individually.

!!! note "Works in Jupyter, scripts, and Dask" File downloads use sync requests in a thread executor, so they are safe to call from any event loop — Jupyter's, zarr's internal loop, fsspec's background loop, or a Dask worker. No special workarounds are needed on Windows or any other platform.

With Authentication

import os
import fsspec, xarray as xr
from dotenv import load_dotenv

load_dotenv()

fs = fsspec.filesystem(
    "dataverse",
    host="borealisdata.ca",
    pid="doi:10.5683/SP3/7HF3IC",
    token=os.environ["DATAVERSE_API_TOKEN"],
)

store = fs.get_mapper("dual_heading.zarr")
ds = xr.open_zarr(store, consolidated=False)

Lazy Loading and Dask

Variables in the returned dataset are backed by Dask arrays:

print(ds.heading)          # shows Dask array — no data loaded yet
mean = ds.heading.mean().compute()   # triggers actual network reads

Only the chunks needed for the computation are fetched, using HTTP Range requests directly against the Borealis S3 backend.

Common Issues

FileNotFoundError on .zmetadata : Set consolidated=False. Zarr tries to read .zmetadata first; if it doesn't exist and consolidated is not False, it raises an error.

Slow first open : The routing table is built by fetching the full dataset JSON from Dataverse. For datasets with hundreds of files this is a single API call but may take a second or two.

Pre-signed URL expiry : Borealis pre-signed S3 URLs expire (typically after 1 hour). For very long-running Dask jobs, create a new filesystem instance to refresh the URL cache.