Reference¶

Complete API reference for dataversefs.

URI Scheme¶

dataverse://<path-within-dataset>

The host and pid (or alias) are passed via storage_options, not encoded in the URI. This keeps URIs short and avoids embedding credentials.

Examples:

# Open a Zarr store inside dataset doi:10.5683/SP3/7HF3IC
xr.open_zarr(
    "dataverse://dual_heading.zarr",
    storage_options={"host": "borealisdata.ca", "pid": "doi:10.5683/SP3/7HF3IC"},
    consolidated=False,
)

# Open a CSV
pd.read_csv(
    "dataverse://data/values.csv",
    storage_options={"host": "borealisdata.ca", "pid": "doi:10.5683/SP3/7HF3IC"},
)

Entry Point Registration¶

The dataverse protocol is registered automatically via the fsspec.specs entry point in pyproject.toml:

[project.entry-points."fsspec.specs"]
dataverse = "dataversefs:DataverseFileSystem"

After installing the package, fsspec.filesystem("dataverse", ...) and dataverse:// URIs work without any explicit import.

`DataverseFileSystem`¶

class DataverseFileSystem(fsspec.asyn.AsyncFileSystem)

Read-only fsspec filesystem scoped to a single Dataverse dataset or sub-dataverse.

Constructor¶

DataverseFileSystem(
    host: str,
    pid: str | None = None,
    alias: str | None = None,
    token: str | None = None,
    head_concurrency: int = 3,
    **storage_options,
)

Parameter	Type	Default	Description
`host`	`str`	—	Dataverse instance hostname (e.g. `"borealisdata.ca"`)
`pid`	`str \\| None`	`None`	Dataset persistent identifier / DOI. Mutually exclusive with `alias`.
`alias`	`str \\| None`	`None`	Sub-dataverse alias. Mutually exclusive with `pid`.
`token`	`str \\| None`	`None`	API token. Sent as `X-Dataverse-key` header. Omit for public datasets.
`head_concurrency`	`int`	`3`	Maximum number of simultaneous HEAD requests to Dataverse. Raise to improve throughput; lower if you see rate-limit errors. See Explanation § HEAD concurrency.
`**storage_options`		—	Passed through to `AsyncFileSystem.__init__()` (e.g. `skip_instance_cache`).

Exactly one of pid or alias must be provided; providing both or neither raises ValueError.

Logging¶

dataversefs uses loguru and disables its own namespace at import time so that importing the package never produces output:

from loguru import logger
logger.disable("dataversefs")  # called at module import

To enable logging in your code:

from loguru import logger
logger.enable("dataversefs")

Level	Emitted for
`INFO`	Routing table fetch start/finish (with elapsed time)
`DEBUG`	Every HEAD redirect resolution and every GET Range request

See the Enable Logging how-to guide for patterns including Jupyter setup and selective DEBUG output.

Public Methods¶

These methods are inherited from fsspec.asyn.AsyncFileSystem and delegated to the async implementations below.

`ls(path, detail=True, **kwargs) → list`¶

List the contents of a directory.

path: path relative to the dataset root (use "" for the root)
detail=True: returns list of dicts with name, size, type, id (files); detail=False returns list of name strings

`info(path, **kwargs) → dict`¶

Return metadata for a single file or directory.

Returns a dict with at minimum: name, size, type. Files also include id (Dataverse file ID), md5, content_type.

`cat_file(path, start=None, end=None, **kwargs) → bytes`¶

Return file contents, optionally a byte slice [start, end).

`cat(path, recursive=False, on_error="raise", **kwargs) → bytes | dict`¶

Return contents of one or more files. Inherited from fsspec.

`open(path, mode="rb", **kwargs) → DataverseFile`¶

Return a file-like object. Only mode="rb" is supported.

`find(path, maxdepth=None, withdirs=False, **kwargs) → list | dict`¶

Recursively list all files (and optionally directories) under path.

path: root of the search (use "" for the entire dataset)
maxdepth: limit search depth; None means unlimited
withdirs=True: include directory entries in results
detail=True (kwarg): return a {path: info} dict instead of a list

Implemented as a direct routing-table scan — no additional network requests.

`glob(pattern, **kwargs) → list`¶

Return paths matching a shell-style wildcard pattern. Inherited from fsspec; uses _find internally.

* matches any sequence of characters within a single path component (no /)
** matches zero or more path components (crosses directory boundaries)
? matches any single character
[seq] matches any character in seq

See the glob() how-to guide for a full pattern reference and worked examples.

`get(rpath, lpath, recursive=False, **kwargs)`¶

Download file(s) to local disk. rpath may be a path string, a list of paths, or a glob pattern. recursive=True mirrors an entire directory tree.

See the Download Files how-to guide for examples.

`get_file(rpath, lpath, **kwargs)`¶

Download a single remote file to a local path. Equivalent to get() for one file.

`exists(path, **kwargs) → bool`¶

Return True if path exists. Inherited from fsspec.

Read-Only Constraint¶

Write operations (mkdir, rm, put, pipe) raise NotImplementedError.

Async Internals¶

Method	Description
`_get_session()`	Lazily create aiohttp session for routing table fetches only
`_get_http()`	Lazily create thread-safe `requests.Session` for file downloads
`_head_sync(url)`	Sync HEAD request to resolve 303 → S3 URL; serialized by `_head_sem`
`_get_bytes_sync(url, start, end)`	Sync GET with optional Range header; returns bytes
`_resolve_download_url(file_id)`	Await `run_in_executor(_head_sync)`; cached in `_download_urls`
`_build_routing_table()`	Fetch dataset JSON via aiohttp; build virtual tree
`_ls(path, detail)`	List direct children of a path
`_info(path)`	Return metadata dict for a path
`_find(path, maxdepth, withdirs, **kwargs)`	Routing-table scan; overrides fsspec default to fix dotted filenames and avoid recursive `_ls`
`_cat_file(path, start, end)`	Await `run_in_executor(_get_bytes_sync)`

Dask Serialization¶

__getstate__ drops _session, _http, _http_lock, and _head_sem (none are picklable). The semaphore is recreated using the stored _head_concurrency value, so workers honour the same concurrency limit as the driver. __setstate__ restores all other state; sessions and locks are recreated lazily on first use. _download_urls is preserved so cached S3 URLs survive serialization.

`DataverseFile`¶

class DataverseFile(fsspec.spec.AbstractBufferedFile)

A read-only file-like object returned by DataverseFileSystem.open().

Constructor¶

DataverseFile(
    fs: DataverseFileSystem,
    path: str,
    file_id: int,
    size: int,
    mode: str = "rb",
    **kwargs,
)

Key Method¶

`_fetch_range(start: int, end: int) → bytes`¶

Fetch bytes [start, end) via HTTP Range request. Called internally by the AbstractBufferedFile base class during read(). Range is translated to the inclusive header bytes=start-(end-1).

The first call resolves and caches the pre-signed S3 URL; subsequent calls for the same file go directly to S3.

Dataverse API Endpoints¶

Endpoint	Method	Purpose
`/api/datasets/:persistentId/?persistentId={PID}`	GET	Fetch dataset metadata + file list
`/api/access/datafile/{id}`	HEAD	Resolve 303 → S3 pre-signed URL
`/api/access/datafile/{id}`	GET	Direct file download (fallback if no redirect)
`/api/dataverses/{alias}/contents`	GET	List child datasets (alias scope)

Reference¶

URI Scheme¶

Entry Point Registration¶

DataverseFileSystem¶

Constructor¶

Logging¶

Public Methods¶

ls(path, detail=True, **kwargs) → list¶

info(path, **kwargs) → dict¶

cat_file(path, start=None, end=None, **kwargs) → bytes¶

cat(path, recursive=False, on_error="raise", **kwargs) → bytes | dict¶

open(path, mode="rb", **kwargs) → DataverseFile¶

find(path, maxdepth=None, withdirs=False, **kwargs) → list | dict¶

glob(pattern, **kwargs) → list¶

get(rpath, lpath, recursive=False, **kwargs)¶

get_file(rpath, lpath, **kwargs)¶

exists(path, **kwargs) → bool¶

Read-Only Constraint¶

Async Internals¶

Dask Serialization¶

DataverseFile¶

Constructor¶

Key Method¶

_fetch_range(start: int, end: int) → bytes¶

Dataverse API Endpoints¶

`DataverseFileSystem`¶

`ls(path, detail=True, **kwargs) → list`¶

`info(path, **kwargs) → dict`¶

`cat_file(path, start=None, end=None, **kwargs) → bytes`¶

`cat(path, recursive=False, on_error="raise", **kwargs) → bytes | dict`¶

`open(path, mode="rb", **kwargs) → DataverseFile`¶

`find(path, maxdepth=None, withdirs=False, **kwargs) → list | dict`¶

`glob(pattern, **kwargs) → list`¶

`get(rpath, lpath, recursive=False, **kwargs)`¶

`get_file(rpath, lpath, **kwargs)`¶

`exists(path, **kwargs) → bool`¶

`DataverseFile`¶

`_fetch_range(start: int, end: int) → bytes`¶