Skip to content

Getting Started

This tutorial walks you from a fresh environment to reading data from a Borealis Dataverse dataset using dataversefs.

Prerequisites

  • Python 3.12+
  • A Borealis account (optional — public datasets work without one)

1. Install

pip install dataversefs
uv add dataversefs

Verify the entry point is registered:

import fsspec
print(fsspec.filesystem("dataverse"))  # should not raise ImportError

2. Get an API Token (optional)

Public datasets work without a token. For restricted datasets:

  1. Log in to borealisdata.ca
  2. Go to Account → API Token
  3. Copy the token

Store it safely — never hard-code it in scripts. A .env file works well:

# .env
DATAVERSE_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

3. Mount the Filesystem

import fsspec

fs = fsspec.filesystem(
    "dataverse",
    host="borealisdata.ca",
    pid="doi:10.5683/SP3/7HF3IC",  # demo dataset with 465 files
    token="your-token-here",         # omit for public datasets
)

The filesystem is scoped to the dataset identified by pid. All paths you pass to fs.ls(), fs.open(), etc. are relative to that dataset's root.

4. List Files

# Top-level contents
entries = fs.ls("")
for e in entries:
    print(e["name"], e["type"])

You should see output like:

# Output:
# README.md file
# dual_heading.zarr directory
# dual_heading_2.zarr directory
# ...

Drill into a subdirectory:

fs.ls("dual_heading.zarr")

5. Read a File

content = fs.cat("README.md")
print(content.decode())

6. Open a Zarr Store with Xarray

import xarray as xr

ds = xr.open_zarr(
    "dataverse://dual_heading.zarr",
    storage_options={
        "host": "borealisdata.ca",
        "pid": "doi:10.5683/SP3/7HF3IC",
    },
    consolidated=False,
)
print(ds)

!!! note Use consolidated=False unless you have generated a consolidated Zarr metadata file. Borealis datasets typically do not include .zmetadata.

Variables in ds are backed by Dask arrays — computation is lazy until you call .compute(). The first .compute() call triggers HTTP Range requests to fetch actual data and may take a few seconds depending on the dataset size and your connection.

Next Steps