Getting Started¶

This tutorial walks you from a fresh environment to reading data from a Borealis Dataverse dataset using dataversefs.

Prerequisites¶

Python 3.12+
A Borealis account (optional — public datasets work without one)

1. Install¶

pipuv

pip install dataversefs

uv add dataversefs

Verify the entry point is registered:

import fsspec
print(fsspec.filesystem("dataverse"))  # should not raise ImportError

2. Get an API Token (optional)¶

Public datasets work without a token. For restricted datasets:

Log in to borealisdata.ca
Go to Account → API Token
Copy the token

Store it safely — never hard-code it in scripts. A .env file works well:

# .env
DATAVERSE_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

3. Mount the Filesystem¶

import fsspec

fs = fsspec.filesystem(
    "dataverse",
    host="borealisdata.ca",
    pid="doi:10.5683/SP3/7HF3IC",  # demo dataset with 465 files
    token="your-token-here",         # omit for public datasets
)

The filesystem is scoped to the dataset identified by pid. All paths you pass to fs.ls(), fs.open(), etc. are relative to that dataset's root.

4. List Files¶

# Top-level contents
entries = fs.ls("")
for e in entries:
    print(e["name"], e["type"])

You should see output like:

# Output:
# README.md file
# dual_heading.zarr directory
# dual_heading_2.zarr directory
# ...

Drill into a subdirectory:

fs.ls("dual_heading.zarr")

5. Read a File¶

content = fs.cat("README.md")
print(content.decode())

6. Open a Zarr Store with Xarray¶

import xarray as xr

ds = xr.open_zarr(
    "dataverse://dual_heading.zarr",
    storage_options={
        "host": "borealisdata.ca",
        "pid": "doi:10.5683/SP3/7HF3IC",
    },
    consolidated=False,
)
print(ds)

!!! note Use consolidated=False unless you have generated a consolidated Zarr metadata file. Borealis datasets typically do not include .zmetadata.

Variables in ds are backed by Dask arrays — computation is lazy until you call .compute(). The first .compute() call triggers HTTP Range requests to fetch actual data and may take a few seconds depending on the dataset size and your connection.