Explanation¶

Design decisions and background context for dataversefs.

1. Flat-to-Hierarchical Tree Reconstruction¶

The Problem¶

Dataverse stores files in a flat list. Each file entry has a directoryLabel field — a slash-separated string like "data/zarr/.zgroup" — but there is no actual directory object in the API response.

When Xarray opens a Zarr store, it calls fs.ls("dual_heading.zarr") to discover variables, then fs.ls("dual_heading.zarr/temperature") to find chunks. Without explicit directory entries for every path prefix, ls() on a parent directory would return an empty list even when files live beneath it.

The Solution¶

_build_routing_table() iterates over the flat file list and creates a synthetic directory entry for every ancestor path. For a file with directoryLabel = "data/zarr/.zgroup" and filename = "0.0":

Full path: data/zarr/.zgroup/0.0    (type: file)
Parents:   data/zarr/.zgroup        (type: directory, synthetic)
           data/zarr                (type: directory, synthetic)
           data                     (type: directory, synthetic)
           ""                       (type: directory, root)

All entries live in a single in-memory dict (_routing_table). _ls() collects keys whose path starts with the requested prefix and has no further / after the prefix — giving exactly the direct children.

Why Zarr Breaks Without This¶

Zarr reads metadata by calling fs.cat(".zgroup"), fs.cat(".zattrs"), etc. at relative paths within the store root. If the filesystem root is not scoped to the dataset (e.g. a full Borealis instance), these relative paths would be ambiguous. The dataset-scoped root ensures "dual_heading.zarr/.zgroup" maps to exactly one file in the routing table, regardless of other datasets on the same host.

2. S3 Redirect and URL Caching¶

The Borealis Access Pattern¶

When you request GET /api/access/datafile/{id}, Borealis does not return the file directly. It returns an HTTP 303 redirect to a pre-signed Amazon S3 URL. The S3 URL carries time-limited credentials embedded as query parameters (X-Amz-Signature, etc.).

Client                      Borealis                    S3
  |--HEAD /api/access/...-->|                           |
  |<--303 Location: https://s3.../file?X-Amz-...--|    |
  |--GET https://s3.../file?X-Amz-...-Range:0-1023---->|
  |<--206 Partial Content---------------------------------|

Why Naive Redirect-Following is Slow for Zarr¶

A Zarr store with 100 variables × 50 chunks = 5 000 chunk reads. If each read triggers a round-trip to Dataverse for the redirect, that is 5 000 extra HTTP calls to a server that may be in a different region than the S3 bucket. At even 20 ms per round-trip, that is 100 seconds of overhead.

The Cache¶

_resolve_download_url(file_id) issues a single HEAD request with allow_redirects=False, reads the Location header, and stores the S3 URL in self._download_urls[file_id]. Every subsequent _cat_file or _fetch_range call for that file goes directly to S3.

async def _resolve_download_url(self, file_id: int) -> str:
    if file_id in self._download_urls:
        return self._download_urls[file_id]   # cache hit
    # ... HEAD request, store result
    self._download_urls[file_id] = url
    return url

Pre-Signed URL Expiry¶

Pre-signed S3 URLs expire (Borealis typically uses 1-hour TTLs). For long-running Dask jobs, create a fresh DataverseFileSystem instance after expiry. The routing table is preserved across instances if you re-use the same storage_options; only the URL cache needs refreshing.

3. Async Architecture and Dask Serialization¶

AsyncFileSystem¶

dataversefs subclasses fsspec.asyn.AsyncFileSystem. Internal methods (_ls, _info, _cat_file, _build_routing_table) are all async def.

The base class auto-generates synchronous public methods (ls, info, cat_file) that call the async internals via fsspec.asyn.sync(). This means:

Synchronous callers (Pandas, plain scripts) get blocking calls that work correctly.
Async callers (Dask async workers) can await the _ methods directly.
You never need to manage an event loop yourself.

Split I/O Strategy¶

dataversefs uses two separate HTTP mechanisms:

Task	Mechanism	Why
`_build_routing_table`	`aiohttp.ClientSession`	Runs once in fsspec's stable background loop; async is fine here
File downloads (`_head_sync`, `_get_bytes_sync`)	sync `requests.Session` via `run_in_executor`	No event-loop affinity — safe from zarr 3's ProactorEventLoop, Jupyter, and Dask workers

zarr 3 creates its own ProactorEventLoop on Windows. If file downloads used aiohttp, two ProactorEventLoops would try to share TCP connections, causing ConnectionResetError (WinError 10054). Sync requests running in a thread executor has no event-loop attachment and works from any calling context.

_head_sem is a threading.Semaphore that limits concurrent HEAD requests to Dataverse. When zarr fires dozens of simultaneous _cat_file calls via asyncio.gather, without a limit, too many simultaneous HTTPS connections to Borealis trigger server-side rate limiting. S3 GET requests are not throttled — only the Dataverse HEAD redirect resolution is limited.

HEAD Concurrency and the `_head_sem`¶

The default limit is 3 simultaneous HEAD requests. This is conservative: it avoids rate-limit errors on shared Borealis instances, at the cost of serialising redirect resolution for the first access to each file.

You can tune this with the head_concurrency constructor parameter:

# Default (safe for shared servers)
fs = fsspec.filesystem("dataverse", host=..., pid=..., head_concurrency=3)

# Higher concurrency — useful if you have a dedicated Borealis instance
# or are opening a dataset with many files for the first time
fs = fsspec.filesystem("dataverse", host=..., pid=..., head_concurrency=10)

Enable logging to measure the actual semaphore wait time and decide whether raising the limit helps:

from loguru import logger
logger.enable("dataversefs")
# Each HEAD log line includes semaphore wait time and total request time

The _head_concurrency value is preserved across Dask serialization so that all workers use the same limit as the driver process.

Dask Serialization¶

Dask distributes work across workers by pickling the filesystem object. Neither aiohttp.ClientSession nor requests.Session nor threading primitives are picklable.

__getstate__ drops them all:

def __getstate__(self):
    state = self.__dict__.copy()
    state["_session"] = None          # aiohttp session
    state["_http"] = None             # requests.Session
    state["_http_lock"] = threading.Lock()
    state["_head_sem"] = threading.Semaphore(self._head_concurrency)  # preserves tuned limit
    return state

__setstate__ restores everything else. Sessions and locks are recreated lazily on first use by the worker.

The _download_urls cache is picklable and is preserved — S3 URLs resolved before serialization are available on the worker without re-issuing HEAD requests.

4. Glob Pattern Semantics and the `_find` Override¶

Why the Default `_find` Broke Dotted Filenames¶

fsspec's default glob() pipeline is:

glob(pattern) → find(root, maxdepth) → walk(root) → recursive _ls()

_ls() lists direct children of a path. Repeated calls build a tree. The problem: when a path component starts with a dot (.zattrs, .zgroup) or contains an extension (.bin), the path normalization through walk → _ls can silently drop or mismatch entries — returning an empty list even when the files exist in the routing table.

The Fix: Routing-Table Scan¶

dataversefs overrides _find to scan the routing table directly:

async def _find(self, path, maxdepth=None, withdirs=False, **kwargs):
    table = await self._build_routing_table()
    path = path.strip("/")
    prefix = f"{path}/" if path else ""

    out = {}
    for key, info in table.items():
        if not key or (path and key != path and not key.startswith(prefix)):
            continue
        relative = key[len(prefix):]
        if maxdepth is not None and relative.count("/") + 1 > maxdepth:
            continue
        if info["type"] == "directory" and not withdirs:
            continue
        out[key] = info
    ...

This is a single O(N) pass over the in-memory dict. It never calls _ls recursively, so path normalization issues cannot arise. All keys come directly from the routing table, where they were stored as plain "dir/subdir/file" strings without any protocol prefix or leading slash.

Glob Depth Semantics¶

glob() extracts a fixed prefix and a depth from the pattern, then calls _find(prefix, maxdepth=depth). The depth is the number of wildcard path components remaining after the prefix:

Pattern	Fixed prefix	Depth passed to `_find`
`*.bin`	`""` (root)	1
`data/*.csv`	`"data/"`	1
`0_raw/gnssa///*`	`"0_raw/gnssa/"`	3
`*/.bin`	`""` (root)	`None` (unlimited)
`0_raw/gnssa/*/.bin`	`"0_raw/gnssa/"`	`None` (unlimited)

The key consequence is that *.bin only finds files at the root level. The * wildcard does not cross directory boundaries. To find .bin files anywhere in the dataset, use **/*.bin.

This is standard POSIX glob behaviour — the same as ls *.bin in a shell or Python's pathlib.Path.glob("*.bin").

Enabling Logging to Diagnose Issues¶

If glob() returns an empty list unexpectedly, enable logging to inspect what the routing table contains:

from loguru import logger
logger.enable("dataversefs")

import fsspec
fs = fsspec.filesystem("dataverse", host=..., pid=...)
fs.ls("")        # INFO: routing table built — N files, M dirs
fs.glob("**/*")  # shows all paths; compare against your pattern