Zarr / Xarray Access via dataversefs¶

This notebook demonstrates end-to-end lazy access to a Zarr store hosted on Borealis Dataverse using dataversefs.

Dataset: doi:10.5683/SP3/7HF3IC — 465 files including dual_heading.zarr
Host: borealisdata.ca

Prerequisites¶

uv add dataversefs xarray zarr numpy matplotlib python-dotenv

Create a .env file with your token (or omit token below for public access):

DATAVERSE_API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

In [ ]:

Copied!





import os

import fsspec
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from dotenv import load_dotenv

import dataversefs  # noqa: F401 — registers 'dataverse' fsspec protocol

load_dotenv()

HOST = "borealisdata.ca"
PID = "doi:10.5683/SP3/7HF3IC"
TOKEN = os.environ.get("DATAVERSE_API_TOKEN")  # None → public access
ZARR_STORE = "dual_heading.zarr"

print(f"Host : {HOST}")
print(f"PID  : {PID}")
print(f"Token: {'set' if TOKEN else 'not set (public access)'}")
import os

import fsspec
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from dotenv import load_dotenv

import dataversefs  # noqa: F401 — registers 'dataverse' fsspec protocol

load_dotenv()

HOST = "borealisdata.ca"
PID = "doi:10.5683/SP3/7HF3IC"
TOKEN = os.environ.get("DATAVERSE_API_TOKEN")  # None → public access
ZARR_STORE = "dual_heading.zarr"

print(f"Host : {HOST}")
print(f"PID  : {PID}")
print(f"Token: {'set' if TOKEN else 'not set (public access)'}")

Enable Logging¶

dataversefs uses loguru for structured logging. The library is silent by default; enable it by calling logger.enable("dataversefs").

Two useful levels:

Level	What you see
`INFO`	Routing table build — number of files/dirs and elapsed time
`DEBUG`	Every HEAD redirect resolution and GET Range request with timing

Run the cell below before creating the filesystem to capture the routing table build log.

In [ ]:

Copied!





import sys
from loguru import logger

# Remove the default handler (stderr) to avoid duplicate output in Jupyter.
# Replace it with a stdout handler scoped to dataversefs, no ANSI colour codes.
logger.remove()
logger.add(
    sys.stdout,
    level="INFO",          # change to "DEBUG" for per-request timing
    filter="dataversefs",
    format="{time:HH:mm:ss.SSS} | {level:<7} | {message}",
    colorize=False,
)
logger.enable("dataversefs")
print("dataversefs logging enabled at INFO level.")
print("Change level='INFO' to level='DEBUG' above for per-request detail.")
import sys
from loguru import logger

# Remove the default handler (stderr) to avoid duplicate output in Jupyter.
# Replace it with a stdout handler scoped to dataversefs, no ANSI colour codes.
logger.remove()
logger.add(
    sys.stdout,
    level="INFO",          # change to "DEBUG" for per-request timing
    filter="dataversefs",
    format="{time:HH:mm:ss.SSS} | {level:<7} | {message}",
    colorize=False,
)
logger.enable("dataversefs")
print("dataversefs logging enabled at INFO level.")
print("Change level='INFO' to level='DEBUG' above for per-request detail.")

Mount the Filesystem¶

The filesystem is scoped to the dataset identified by pid. All paths are relative to the dataset root.

In [ ]:

Copied!





fs = fsspec.filesystem(
    "dataverse",
    host=HOST,
    pid=PID,
    token=TOKEN,
    skip_instance_cache=True,  # ensure a fresh session for this notebook
)

root_entries = fs.ls("", detail=True)
print(f"{len(root_entries)} entries at dataset root:")
for e in root_entries[:20]:
    print(f"  [{e['type']:9s}]  {e['name']}")
if len(root_entries) > 20:
    print(f"  ... and {len(root_entries) - 20} more")
fs = fsspec.filesystem(
    "dataverse",
    host=HOST,
    pid=PID,
    token=TOKEN,
    skip_instance_cache=True,  # ensure a fresh session for this notebook
)

root_entries = fs.ls("", detail=True)
print(f"{len(root_entries)} entries at dataset root:")
for e in root_entries[:20]:
    print(f"  [{e['type']:9s}]  {e['name']}")
if len(root_entries) > 20:
    print(f"  ... and {len(root_entries) - 20} more")

Discover the Zarr Store¶

List the top-level contents of dual_heading.zarr to confirm the store structure is intact.

In [ ]:

Copied!





zarr_entries = fs.ls(ZARR_STORE, detail=True)
print(f"Contents of {ZARR_STORE}:")
for e in zarr_entries:
    size_str = f"{e['size']:>8d} B" if e["type"] == "file" else "         "
    print(f"  [{e['type']:9s}]  {size_str}  {e['name']}")
zarr_entries = fs.ls(ZARR_STORE, detail=True)
print(f"Contents of {ZARR_STORE}:")
for e in zarr_entries:
    size_str = f"{e['size']:>8d} B" if e["type"] == "file" else "         "
    print(f"  [{e['type']:9s}]  {size_str}  {e['name']}")

Open with Xarray¶

Use fs.get_mapper() to hand a sync MutableMapping directly to xr.open_zarr. This reuses the existing fs instance — same routing table, same S3 URL cache — instead of letting zarr 3 create a second filesystem from a dataverse:// URI.

Pass consolidated=False — Borealis datasets do not include a consolidated .zmetadata file, so Zarr must query each metadata key individually.

In [ ]:

Copied!

store = fs.get_mapper(ZARR_STORE)
ds = xr.open_zarr(store, consolidated=False)
print(ds)
store = fs.get_mapper(ZARR_STORE)
ds = xr.open_zarr(store, consolidated=False)
print(ds)

Inspect the Dataset¶

In [ ]:

Copied!





print("Data variables:", list(ds.data_vars))
print("Dimensions    :", dict(ds.dims))
print("Coordinates   :", list(ds.coords))
print()
ds.info()
print("Data variables:", list(ds.data_vars))
print("Dimensions    :", dict(ds.dims))
print("Coordinates   :", list(ds.coords))
print()
ds.info()

Lazy Loading¶

Variables are backed by Dask arrays — no data has been downloaded yet.

In [ ]:

Copied!





heading_var = ds[list(ds.data_vars)[0]]
print(f"Variable : {heading_var.name}")
print(f"Type     : {type(heading_var.data)}")
print(f"Shape    : {heading_var.shape}")
print(f"Chunks   : {heading_var.chunks}")
print()
print("No data downloaded yet — this is a Dask graph.")
heading_var = ds[list(ds.data_vars)[0]]
print(f"Variable : {heading_var.name}")
print(f"Type     : {type(heading_var.data)}")
print(f"Shape    : {heading_var.shape}")
print(f"Chunks   : {heading_var.chunks}")
print()
print("No data downloaded yet — this is a Dask graph.")

Compute a Statistic¶

.compute() triggers actual HTTP Range requests to the Borealis S3 backend.

In [ ]:

Copied!





first_var = list(ds.data_vars)[0]
var_mean = ds[first_var].mean().compute()
print(f"Mean of {first_var}: {float(var_mean):.6f} {ds[first_var].attrs.get('units', '')}")

# Show the S3 URL cache — one URL per unique file after all reads
print(f"\nCached S3 URLs: {len(fs._download_urls)}")
first_var = list(ds.data_vars)[0]
var_mean = ds[first_var].mean().compute()
print(f"Mean of {first_var}: {float(var_mean):.6f} {ds[first_var].attrs.get('units', '')}")

# Show the S3 URL cache — one URL per unique file after all reads
print(f"\nCached S3 URLs: {len(fs._download_urls)}")

Plot¶

Load up to two variables and plot their time series.

In [ ]:

Copied!





vars_to_plot = list(ds.data_vars)[:2]

fig, axes = plt.subplots(len(vars_to_plot), 1, figsize=(12, 4 * len(vars_to_plot)), sharex=True)
if len(vars_to_plot) == 1:
    axes = [axes]

# Determine the time-like dimension
time_dim = None
for dim in ds.dims:
    if "time" in dim.lower() or dim == list(ds.dims)[0]:
        time_dim = dim
        break

# Only fetch the first 2000 points along the time dimension to keep
# the demo fast — the full variable may contain millions of chunks.
MAX_POINTS = 2000

for ax, var_name in zip(axes, vars_to_plot):
    var = ds[var_name]
    # Select first element of all non-time dims
    for dim in var.dims:
        if dim != time_dim and var.sizes[dim] > 1:
            var = var.isel({dim: 0})
    # Slice to MAX_POINTS before triggering any downloads
    if time_dim and var.sizes.get(time_dim, 0) > MAX_POINTS:
        var = var.isel({time_dim: slice(0, MAX_POINTS)})
    data = var.compute()
    ax.plot(data.values, linewidth=0.8)
    ax.set_ylabel(f"{var_name}\n{var.attrs.get('units', '')}")
    ax.set_title(var.attrs.get("long_name", var_name))
    ax.grid(True, alpha=0.3)

axes[-1].set_xlabel(time_dim or "index")
fig.suptitle(f"First {MAX_POINTS} points — {ZARR_STORE} ({PID})", y=1.01)
plt.tight_layout()
plt.show()
vars_to_plot = list(ds.data_vars)[:2]

fig, axes = plt.subplots(len(vars_to_plot), 1, figsize=(12, 4 * len(vars_to_plot)), sharex=True)
if len(vars_to_plot) == 1:
    axes = [axes]

# Determine the time-like dimension
time_dim = None
for dim in ds.dims:
    if "time" in dim.lower() or dim == list(ds.dims)[0]:
        time_dim = dim
        break

# Only fetch the first 2000 points along the time dimension to keep
# the demo fast — the full variable may contain millions of chunks.
MAX_POINTS = 2000

for ax, var_name in zip(axes, vars_to_plot):
    var = ds[var_name]
    # Select first element of all non-time dims
    for dim in var.dims:
        if dim != time_dim and var.sizes[dim] > 1:
            var = var.isel({dim: 0})
    # Slice to MAX_POINTS before triggering any downloads
    if time_dim and var.sizes.get(time_dim, 0) > MAX_POINTS:
        var = var.isel({time_dim: slice(0, MAX_POINTS)})
    data = var.compute()
    ax.plot(data.values, linewidth=0.8)
    ax.set_ylabel(f"{var_name}\n{var.attrs.get('units', '')}")
    ax.set_title(var.attrs.get("long_name", var_name))
    ax.grid(True, alpha=0.3)

axes[-1].set_xlabel(time_dim or "index")
fig.suptitle(f"First {MAX_POINTS} points — {ZARR_STORE} ({PID})", y=1.01)
plt.tight_layout()
plt.show()

What Happened Under the Hood¶

fsspec.filesystem("dataverse", ...) — instantiated DataverseFileSystem and registered the dataverse:// protocol.
Routing table construction — a single GET /api/datasets/:persistentId/?persistentId=doi:10.5683/SP3/7HF3IC fetched the full file list (465 files) via aiohttp in fsspec's background loop. _build_routing_table() reconstructed the virtual directory tree by inferring synthetic directory entries from directoryLabel metadata. Two INFO-level log lines are emitted (start and finish with timing); call logger.enable("dataversefs") before mounting the filesystem to see them (see the Enable Logging section above).
fs.get_mapper(ZARR_STORE) — returns an FSMap: a sync MutableMapping backed by the existing fs instance. Zarr uses this as its store interface, sharing the routing table and S3 URL cache already built by fs.ls().
xr.open_zarr(store, ...) — zarr 3 detects FSMap and creates an async FsspecStore. It calls await self.fs._cat_file(path) concurrently for all metadata keys (.zgroup, .zattrs, var/.zarray, …) via asyncio.gather. Each call:
- Calls _resolve_download_url → loop.run_in_executor(None, _head_sync, ...) to resolve the 303 → S3 pre-signed URL (once per file, cached in fs._download_urls). A threading semaphore limits concurrent HEAD requests to 3 to avoid server rate limits.
- Calls loop.run_in_executor(None, _get_bytes_sync, ...) to fetch the metadata bytes via sync requests. Running downloads in a thread executor means no event-loop affinity — safe on Windows and Jupyter alike.
Lazy array — after metadata was read, ds holds a Dask graph. No chunk data was downloaded until .compute() was called.
Chunk reads — .compute() triggered Range requests directly to S3, bypassing Dataverse for the actual data. The S3 URLs were cached in fs._download_urls, so the HEAD redirect was paid only once per file.