Zarr / Xarray Access via dataversefs¶
This notebook demonstrates end-to-end lazy access to a Zarr store hosted on Borealis Dataverse using dataversefs.
Dataset: doi:10.5683/SP3/7HF3IC — 465 files including dual_heading.zarr
Host: borealisdata.ca
Prerequisites¶
uv add dataversefs xarray zarr numpy matplotlib python-dotenv
Create a .env file with your token (or omit token below for public access):
DATAVERSE_API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
import os
import fsspec
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from dotenv import load_dotenv
import dataversefs # noqa: F401 — registers 'dataverse' fsspec protocol
load_dotenv()
HOST = "borealisdata.ca"
PID = "doi:10.5683/SP3/7HF3IC"
TOKEN = os.environ.get("DATAVERSE_API_TOKEN") # None → public access
ZARR_STORE = "dual_heading.zarr"
print(f"Host : {HOST}")
print(f"PID : {PID}")
print(f"Token: {'set' if TOKEN else 'not set (public access)'}")
Enable Logging¶
dataversefs uses loguru for structured logging.
The library is silent by default; enable it by calling logger.enable("dataversefs").
Two useful levels:
| Level | What you see |
|---|---|
INFO |
Routing table build — number of files/dirs and elapsed time |
DEBUG |
Every HEAD redirect resolution and GET Range request with timing |
Run the cell below before creating the filesystem to capture the routing table build log.
import sys
from loguru import logger
# Remove the default handler (stderr) to avoid duplicate output in Jupyter.
# Replace it with a stdout handler scoped to dataversefs, no ANSI colour codes.
logger.remove()
logger.add(
sys.stdout,
level="INFO", # change to "DEBUG" for per-request timing
filter="dataversefs",
format="{time:HH:mm:ss.SSS} | {level:<7} | {message}",
colorize=False,
)
logger.enable("dataversefs")
print("dataversefs logging enabled at INFO level.")
print("Change level='INFO' to level='DEBUG' above for per-request detail.")
Mount the Filesystem¶
The filesystem is scoped to the dataset identified by pid. All paths are relative to the dataset root.
fs = fsspec.filesystem(
"dataverse",
host=HOST,
pid=PID,
token=TOKEN,
skip_instance_cache=True, # ensure a fresh session for this notebook
)
root_entries = fs.ls("", detail=True)
print(f"{len(root_entries)} entries at dataset root:")
for e in root_entries[:20]:
print(f" [{e['type']:9s}] {e['name']}")
if len(root_entries) > 20:
print(f" ... and {len(root_entries) - 20} more")
Discover the Zarr Store¶
List the top-level contents of dual_heading.zarr to confirm the store structure is intact.
zarr_entries = fs.ls(ZARR_STORE, detail=True)
print(f"Contents of {ZARR_STORE}:")
for e in zarr_entries:
size_str = f"{e['size']:>8d} B" if e["type"] == "file" else " "
print(f" [{e['type']:9s}] {size_str} {e['name']}")
Open with Xarray¶
Use fs.get_mapper() to hand a sync MutableMapping directly to xr.open_zarr.
This reuses the existing fs instance — same routing table, same S3 URL cache —
instead of letting zarr 3 create a second filesystem from a dataverse:// URI.
Pass consolidated=False — Borealis datasets do not include a consolidated
.zmetadata file, so Zarr must query each metadata key individually.
store = fs.get_mapper(ZARR_STORE)
ds = xr.open_zarr(store, consolidated=False)
print(ds)
Inspect the Dataset¶
print("Data variables:", list(ds.data_vars))
print("Dimensions :", dict(ds.dims))
print("Coordinates :", list(ds.coords))
print()
ds.info()
Lazy Loading¶
Variables are backed by Dask arrays — no data has been downloaded yet.
heading_var = ds[list(ds.data_vars)[0]]
print(f"Variable : {heading_var.name}")
print(f"Type : {type(heading_var.data)}")
print(f"Shape : {heading_var.shape}")
print(f"Chunks : {heading_var.chunks}")
print()
print("No data downloaded yet — this is a Dask graph.")
Compute a Statistic¶
.compute() triggers actual HTTP Range requests to the Borealis S3 backend.
first_var = list(ds.data_vars)[0]
var_mean = ds[first_var].mean().compute()
print(f"Mean of {first_var}: {float(var_mean):.6f} {ds[first_var].attrs.get('units', '')}")
# Show the S3 URL cache — one URL per unique file after all reads
print(f"\nCached S3 URLs: {len(fs._download_urls)}")
Plot¶
Load up to two variables and plot their time series.
vars_to_plot = list(ds.data_vars)[:2]
fig, axes = plt.subplots(len(vars_to_plot), 1, figsize=(12, 4 * len(vars_to_plot)), sharex=True)
if len(vars_to_plot) == 1:
axes = [axes]
# Determine the time-like dimension
time_dim = None
for dim in ds.dims:
if "time" in dim.lower() or dim == list(ds.dims)[0]:
time_dim = dim
break
# Only fetch the first 2000 points along the time dimension to keep
# the demo fast — the full variable may contain millions of chunks.
MAX_POINTS = 2000
for ax, var_name in zip(axes, vars_to_plot):
var = ds[var_name]
# Select first element of all non-time dims
for dim in var.dims:
if dim != time_dim and var.sizes[dim] > 1:
var = var.isel({dim: 0})
# Slice to MAX_POINTS before triggering any downloads
if time_dim and var.sizes.get(time_dim, 0) > MAX_POINTS:
var = var.isel({time_dim: slice(0, MAX_POINTS)})
data = var.compute()
ax.plot(data.values, linewidth=0.8)
ax.set_ylabel(f"{var_name}\n{var.attrs.get('units', '')}")
ax.set_title(var.attrs.get("long_name", var_name))
ax.grid(True, alpha=0.3)
axes[-1].set_xlabel(time_dim or "index")
fig.suptitle(f"First {MAX_POINTS} points — {ZARR_STORE} ({PID})", y=1.01)
plt.tight_layout()
plt.show()
What Happened Under the Hood¶
fsspec.filesystem("dataverse", ...)— instantiatedDataverseFileSystemand registered thedataverse://protocol.Routing table construction — a single
GET /api/datasets/:persistentId/?persistentId=doi:10.5683/SP3/7HF3ICfetched the full file list (465 files) via aiohttp in fsspec's background loop._build_routing_table()reconstructed the virtual directory tree by inferring synthetic directory entries fromdirectoryLabelmetadata. TwoINFO-level log lines are emitted (start and finish with timing); calllogger.enable("dataversefs")before mounting the filesystem to see them (see the Enable Logging section above).fs.get_mapper(ZARR_STORE)— returns anFSMap: a syncMutableMappingbacked by the existingfsinstance. Zarr uses this as its store interface, sharing the routing table and S3 URL cache already built byfs.ls().xr.open_zarr(store, ...)— zarr 3 detectsFSMapand creates an asyncFsspecStore. It callsawait self.fs._cat_file(path)concurrently for all metadata keys (.zgroup,.zattrs,var/.zarray, …) viaasyncio.gather. Each call:- Calls
_resolve_download_url→loop.run_in_executor(None, _head_sync, ...)to resolve the 303 → S3 pre-signed URL (once per file, cached infs._download_urls). A threading semaphore limits concurrent HEAD requests to 3 to avoid server rate limits. - Calls
loop.run_in_executor(None, _get_bytes_sync, ...)to fetch the metadata bytes via syncrequests. Running downloads in a thread executor means no event-loop affinity — safe on Windows and Jupyter alike.
- Calls
Lazy array — after metadata was read,
dsholds a Dask graph. No chunk data was downloaded until.compute()was called.Chunk reads —
.compute()triggered Range requests directly to S3, bypassing Dataverse for the actual data. The S3 URLs were cached infs._download_urls, so the HEAD redirect was paid only once per file.