Skip to content

Reference

Complete API reference for dataversefs.


URI Scheme

dataverse://<path-within-dataset>

The host and pid (or alias) are passed via storage_options, not encoded in the URI. This keeps URIs short and avoids embedding credentials.

Examples:

# Open a Zarr store inside dataset doi:10.5683/SP3/7HF3IC
xr.open_zarr(
    "dataverse://dual_heading.zarr",
    storage_options={"host": "borealisdata.ca", "pid": "doi:10.5683/SP3/7HF3IC"},
    consolidated=False,
)

# Open a CSV
pd.read_csv(
    "dataverse://data/values.csv",
    storage_options={"host": "borealisdata.ca", "pid": "doi:10.5683/SP3/7HF3IC"},
)

Entry Point Registration

The dataverse protocol is registered automatically via the fsspec.specs entry point in pyproject.toml:

[project.entry-points."fsspec.specs"]
dataverse = "dataversefs:DataverseFileSystem"

After installing the package, fsspec.filesystem("dataverse", ...) and dataverse:// URIs work without any explicit import.


DataverseFileSystem

class DataverseFileSystem(fsspec.asyn.AsyncFileSystem)

Read-only fsspec filesystem scoped to a single Dataverse dataset or sub-dataverse.

Constructor

DataverseFileSystem(
    host: str,
    pid: str | None = None,
    alias: str | None = None,
    token: str | None = None,
    head_concurrency: int = 3,
    **storage_options,
)
Parameter Type Default Description
host str Dataverse instance hostname (e.g. "borealisdata.ca")
pid str \| None None Dataset persistent identifier / DOI. Mutually exclusive with alias.
alias str \| None None Sub-dataverse alias. Mutually exclusive with pid.
token str \| None None API token. Sent as X-Dataverse-key header. Omit for public datasets.
head_concurrency int 3 Maximum number of simultaneous HEAD requests to Dataverse. Raise to improve throughput; lower if you see rate-limit errors. See Explanation § HEAD concurrency.
**storage_options Passed through to AsyncFileSystem.__init__() (e.g. skip_instance_cache).

Exactly one of pid or alias must be provided; providing both or neither raises ValueError.

Logging

dataversefs uses loguru and disables its own namespace at import time so that importing the package never produces output:

from loguru import logger
logger.disable("dataversefs")  # called at module import

To enable logging in your code:

from loguru import logger
logger.enable("dataversefs")
Level Emitted for
INFO Routing table fetch start/finish (with elapsed time)
DEBUG Every HEAD redirect resolution and every GET Range request

See the Enable Logging how-to guide for patterns including Jupyter setup and selective DEBUG output.

Public Methods

These methods are inherited from fsspec.asyn.AsyncFileSystem and delegated to the async implementations below.

ls(path, detail=True, **kwargs) → list

List the contents of a directory.

  • path: path relative to the dataset root (use "" for the root)
  • detail=True: returns list of dicts with name, size, type, id (files); detail=False returns list of name strings

info(path, **kwargs) → dict

Return metadata for a single file or directory.

Returns a dict with at minimum: name, size, type. Files also include id (Dataverse file ID), md5, content_type.

cat_file(path, start=None, end=None, **kwargs) → bytes

Return file contents, optionally a byte slice [start, end).

cat(path, recursive=False, on_error="raise", **kwargs) → bytes | dict

Return contents of one or more files. Inherited from fsspec.

open(path, mode="rb", **kwargs) → DataverseFile

Return a file-like object. Only mode="rb" is supported.

find(path, maxdepth=None, withdirs=False, **kwargs) → list | dict

Recursively list all files (and optionally directories) under path.

  • path: root of the search (use "" for the entire dataset)
  • maxdepth: limit search depth; None means unlimited
  • withdirs=True: include directory entries in results
  • detail=True (kwarg): return a {path: info} dict instead of a list

Implemented as a direct routing-table scan — no additional network requests.

glob(pattern, **kwargs) → list

Return paths matching a shell-style wildcard pattern. Inherited from fsspec; uses _find internally.

  • * matches any sequence of characters within a single path component (no /)
  • ** matches zero or more path components (crosses directory boundaries)
  • ? matches any single character
  • [seq] matches any character in seq

See the glob() how-to guide for a full pattern reference and worked examples.

get(rpath, lpath, recursive=False, **kwargs)

Download file(s) to local disk. rpath may be a path string, a list of paths, or a glob pattern. recursive=True mirrors an entire directory tree.

See the Download Files how-to guide for examples.

get_file(rpath, lpath, **kwargs)

Download a single remote file to a local path. Equivalent to get() for one file.

exists(path, **kwargs) → bool

Return True if path exists. Inherited from fsspec.

Read-Only Constraint

Write operations (mkdir, rm, put, pipe) raise NotImplementedError.

Async Internals

Method Description
_get_session() Lazily create aiohttp session for routing table fetches only
_get_http() Lazily create thread-safe requests.Session for file downloads
_head_sync(url) Sync HEAD request to resolve 303 → S3 URL; serialized by _head_sem
_get_bytes_sync(url, start, end) Sync GET with optional Range header; returns bytes
_resolve_download_url(file_id) Await run_in_executor(_head_sync); cached in _download_urls
_build_routing_table() Fetch dataset JSON via aiohttp; build virtual tree
_ls(path, detail) List direct children of a path
_info(path) Return metadata dict for a path
_find(path, maxdepth, withdirs, **kwargs) Routing-table scan; overrides fsspec default to fix dotted filenames and avoid recursive _ls
_cat_file(path, start, end) Await run_in_executor(_get_bytes_sync)

Dask Serialization

__getstate__ drops _session, _http, _http_lock, and _head_sem (none are picklable). The semaphore is recreated using the stored _head_concurrency value, so workers honour the same concurrency limit as the driver. __setstate__ restores all other state; sessions and locks are recreated lazily on first use. _download_urls is preserved so cached S3 URLs survive serialization.


DataverseFile

class DataverseFile(fsspec.spec.AbstractBufferedFile)

A read-only file-like object returned by DataverseFileSystem.open().

Constructor

DataverseFile(
    fs: DataverseFileSystem,
    path: str,
    file_id: int,
    size: int,
    mode: str = "rb",
    **kwargs,
)

Key Method

_fetch_range(start: int, end: int) → bytes

Fetch bytes [start, end) via HTTP Range request. Called internally by the AbstractBufferedFile base class during read(). Range is translated to the inclusive header bytes=start-(end-1).

The first call resolves and caches the pre-signed S3 URL; subsequent calls for the same file go directly to S3.


Dataverse API Endpoints

Endpoint Method Purpose
/api/datasets/:persistentId/?persistentId={PID} GET Fetch dataset metadata + file list
/api/access/datafile/{id} HEAD Resolve 303 → S3 pre-signed URL
/api/access/datafile/{id} GET Direct file download (fallback if no redirect)
/api/dataverses/{alias}/contents GET List child datasets (alias scope)