Reference¶
Complete API reference for dataversefs.
URI Scheme¶
dataverse://<path-within-dataset>
The host and pid (or alias) are passed via storage_options, not encoded in the
URI. This keeps URIs short and avoids embedding credentials.
Examples:
# Open a Zarr store inside dataset doi:10.5683/SP3/7HF3IC
xr.open_zarr(
"dataverse://dual_heading.zarr",
storage_options={"host": "borealisdata.ca", "pid": "doi:10.5683/SP3/7HF3IC"},
consolidated=False,
)
# Open a CSV
pd.read_csv(
"dataverse://data/values.csv",
storage_options={"host": "borealisdata.ca", "pid": "doi:10.5683/SP3/7HF3IC"},
)
Entry Point Registration¶
The dataverse protocol is registered automatically via the fsspec.specs entry point
in pyproject.toml:
[project.entry-points."fsspec.specs"]
dataverse = "dataversefs:DataverseFileSystem"
After installing the package, fsspec.filesystem("dataverse", ...) and dataverse://
URIs work without any explicit import.
DataverseFileSystem¶
class DataverseFileSystem(fsspec.asyn.AsyncFileSystem)
Read-only fsspec filesystem scoped to a single Dataverse dataset or sub-dataverse.
Constructor¶
DataverseFileSystem(
host: str,
pid: str | None = None,
alias: str | None = None,
token: str | None = None,
head_concurrency: int = 3,
**storage_options,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
host |
str |
— | Dataverse instance hostname (e.g. "borealisdata.ca") |
pid |
str \| None |
None |
Dataset persistent identifier / DOI. Mutually exclusive with alias. |
alias |
str \| None |
None |
Sub-dataverse alias. Mutually exclusive with pid. |
token |
str \| None |
None |
API token. Sent as X-Dataverse-key header. Omit for public datasets. |
head_concurrency |
int |
3 |
Maximum number of simultaneous HEAD requests to Dataverse. Raise to improve throughput; lower if you see rate-limit errors. See Explanation § HEAD concurrency. |
**storage_options |
— | Passed through to AsyncFileSystem.__init__() (e.g. skip_instance_cache). |
Exactly one of pid or alias must be provided; providing both or neither raises
ValueError.
Logging¶
dataversefs uses loguru and disables its own namespace at import time so that importing the package never produces output:
from loguru import logger
logger.disable("dataversefs") # called at module import
To enable logging in your code:
from loguru import logger
logger.enable("dataversefs")
| Level | Emitted for |
|---|---|
INFO |
Routing table fetch start/finish (with elapsed time) |
DEBUG |
Every HEAD redirect resolution and every GET Range request |
See the Enable Logging how-to guide for patterns including Jupyter setup and selective DEBUG output.
Public Methods¶
These methods are inherited from fsspec.asyn.AsyncFileSystem and delegated to the
async implementations below.
ls(path, detail=True, **kwargs) → list¶
List the contents of a directory.
path: path relative to the dataset root (use""for the root)detail=True: returns list of dicts withname,size,type,id(files);detail=Falsereturns list of name strings
info(path, **kwargs) → dict¶
Return metadata for a single file or directory.
Returns a dict with at minimum: name, size, type. Files also include id
(Dataverse file ID), md5, content_type.
cat_file(path, start=None, end=None, **kwargs) → bytes¶
Return file contents, optionally a byte slice [start, end).
cat(path, recursive=False, on_error="raise", **kwargs) → bytes | dict¶
Return contents of one or more files. Inherited from fsspec.
open(path, mode="rb", **kwargs) → DataverseFile¶
Return a file-like object. Only mode="rb" is supported.
find(path, maxdepth=None, withdirs=False, **kwargs) → list | dict¶
Recursively list all files (and optionally directories) under path.
path: root of the search (use""for the entire dataset)maxdepth: limit search depth;Nonemeans unlimitedwithdirs=True: include directory entries in resultsdetail=True(kwarg): return a{path: info}dict instead of a list
Implemented as a direct routing-table scan — no additional network requests.
glob(pattern, **kwargs) → list¶
Return paths matching a shell-style wildcard pattern. Inherited from fsspec;
uses _find internally.
*matches any sequence of characters within a single path component (no/)**matches zero or more path components (crosses directory boundaries)?matches any single character[seq]matches any character inseq
See the glob() how-to guide for a full pattern reference and worked examples.
get(rpath, lpath, recursive=False, **kwargs)¶
Download file(s) to local disk. rpath may be a path string, a list of paths, or a
glob pattern. recursive=True mirrors an entire directory tree.
See the Download Files how-to guide for examples.
get_file(rpath, lpath, **kwargs)¶
Download a single remote file to a local path. Equivalent to get() for one file.
exists(path, **kwargs) → bool¶
Return True if path exists. Inherited from fsspec.
Read-Only Constraint¶
Write operations (mkdir, rm, put, pipe) raise NotImplementedError.
Async Internals¶
| Method | Description |
|---|---|
_get_session() |
Lazily create aiohttp session for routing table fetches only |
_get_http() |
Lazily create thread-safe requests.Session for file downloads |
_head_sync(url) |
Sync HEAD request to resolve 303 → S3 URL; serialized by _head_sem |
_get_bytes_sync(url, start, end) |
Sync GET with optional Range header; returns bytes |
_resolve_download_url(file_id) |
Await run_in_executor(_head_sync); cached in _download_urls |
_build_routing_table() |
Fetch dataset JSON via aiohttp; build virtual tree |
_ls(path, detail) |
List direct children of a path |
_info(path) |
Return metadata dict for a path |
_find(path, maxdepth, withdirs, **kwargs) |
Routing-table scan; overrides fsspec default to fix dotted filenames and avoid recursive _ls |
_cat_file(path, start, end) |
Await run_in_executor(_get_bytes_sync) |
Dask Serialization¶
__getstate__ drops _session, _http, _http_lock, and _head_sem (none are
picklable). The semaphore is recreated using the stored _head_concurrency value, so
workers honour the same concurrency limit as the driver. __setstate__ restores all
other state; sessions and locks are recreated lazily on first use. _download_urls is
preserved so cached S3 URLs survive serialization.
DataverseFile¶
class DataverseFile(fsspec.spec.AbstractBufferedFile)
A read-only file-like object returned by DataverseFileSystem.open().
Constructor¶
DataverseFile(
fs: DataverseFileSystem,
path: str,
file_id: int,
size: int,
mode: str = "rb",
**kwargs,
)
Key Method¶
_fetch_range(start: int, end: int) → bytes¶
Fetch bytes [start, end) via HTTP Range request. Called internally by the
AbstractBufferedFile base class during read(). Range is translated to the inclusive
header bytes=start-(end-1).
The first call resolves and caches the pre-signed S3 URL; subsequent calls for the same file go directly to S3.
Dataverse API Endpoints¶
| Endpoint | Method | Purpose |
|---|---|---|
/api/datasets/:persistentId/?persistentId={PID} |
GET | Fetch dataset metadata + file list |
/api/access/datafile/{id} |
HEAD | Resolve 303 → S3 pre-signed URL |
/api/access/datafile/{id} |
GET | Direct file download (fallback if no redirect) |
/api/dataverses/{alias}/contents |
GET | List child datasets (alias scope) |