Skip to content

Explanation

Design decisions and background context for dataversefs.


1. Flat-to-Hierarchical Tree Reconstruction

The Problem

Dataverse stores files in a flat list. Each file entry has a directoryLabel field — a slash-separated string like "data/zarr/.zgroup" — but there is no actual directory object in the API response.

When Xarray opens a Zarr store, it calls fs.ls("dual_heading.zarr") to discover variables, then fs.ls("dual_heading.zarr/temperature") to find chunks. Without explicit directory entries for every path prefix, ls() on a parent directory would return an empty list even when files live beneath it.

The Solution

_build_routing_table() iterates over the flat file list and creates a synthetic directory entry for every ancestor path. For a file with directoryLabel = "data/zarr/.zgroup" and filename = "0.0":

Full path: data/zarr/.zgroup/0.0    (type: file)
Parents:   data/zarr/.zgroup        (type: directory, synthetic)
           data/zarr                (type: directory, synthetic)
           data                     (type: directory, synthetic)
           ""                       (type: directory, root)

All entries live in a single in-memory dict (_routing_table). _ls() collects keys whose path starts with the requested prefix and has no further / after the prefix — giving exactly the direct children.

Why Zarr Breaks Without This

Zarr reads metadata by calling fs.cat(".zgroup"), fs.cat(".zattrs"), etc. at relative paths within the store root. If the filesystem root is not scoped to the dataset (e.g. a full Borealis instance), these relative paths would be ambiguous. The dataset-scoped root ensures "dual_heading.zarr/.zgroup" maps to exactly one file in the routing table, regardless of other datasets on the same host.


2. S3 Redirect and URL Caching

The Borealis Access Pattern

When you request GET /api/access/datafile/{id}, Borealis does not return the file directly. It returns an HTTP 303 redirect to a pre-signed Amazon S3 URL. The S3 URL carries time-limited credentials embedded as query parameters (X-Amz-Signature, etc.).

Client                      Borealis                    S3
  |--HEAD /api/access/...-->|                           |
  |<--303 Location: https://s3.../file?X-Amz-...--|    |
  |--GET https://s3.../file?X-Amz-...-Range:0-1023---->|
  |<--206 Partial Content---------------------------------|

Why Naive Redirect-Following is Slow for Zarr

A Zarr store with 100 variables × 50 chunks = 5 000 chunk reads. If each read triggers a round-trip to Dataverse for the redirect, that is 5 000 extra HTTP calls to a server that may be in a different region than the S3 bucket. At even 20 ms per round-trip, that is 100 seconds of overhead.

The Cache

_resolve_download_url(file_id) issues a single HEAD request with allow_redirects=False, reads the Location header, and stores the S3 URL in self._download_urls[file_id]. Every subsequent _cat_file or _fetch_range call for that file goes directly to S3.

async def _resolve_download_url(self, file_id: int) -> str:
    if file_id in self._download_urls:
        return self._download_urls[file_id]   # cache hit
    # ... HEAD request, store result
    self._download_urls[file_id] = url
    return url

Pre-Signed URL Expiry

Pre-signed S3 URLs expire (Borealis typically uses 1-hour TTLs). For long-running Dask jobs, create a fresh DataverseFileSystem instance after expiry. The routing table is preserved across instances if you re-use the same storage_options; only the URL cache needs refreshing.


3. Async Architecture and Dask Serialization

AsyncFileSystem

dataversefs subclasses fsspec.asyn.AsyncFileSystem. Internal methods (_ls, _info, _cat_file, _build_routing_table) are all async def.

The base class auto-generates synchronous public methods (ls, info, cat_file) that call the async internals via fsspec.asyn.sync(). This means:

  • Synchronous callers (Pandas, plain scripts) get blocking calls that work correctly.
  • Async callers (Dask async workers) can await the _ methods directly.
  • You never need to manage an event loop yourself.

Split I/O Strategy

dataversefs uses two separate HTTP mechanisms:

Task Mechanism Why
_build_routing_table aiohttp.ClientSession Runs once in fsspec's stable background loop; async is fine here
File downloads (_head_sync, _get_bytes_sync) sync requests.Session via run_in_executor No event-loop affinity — safe from zarr 3's ProactorEventLoop, Jupyter, and Dask workers

zarr 3 creates its own ProactorEventLoop on Windows. If file downloads used aiohttp, two ProactorEventLoops would try to share TCP connections, causing ConnectionResetError (WinError 10054). Sync requests running in a thread executor has no event-loop attachment and works from any calling context.

_head_sem is a threading.Semaphore that limits concurrent HEAD requests to Dataverse. When zarr fires dozens of simultaneous _cat_file calls via asyncio.gather, without a limit, too many simultaneous HTTPS connections to Borealis trigger server-side rate limiting. S3 GET requests are not throttled — only the Dataverse HEAD redirect resolution is limited.

HEAD Concurrency and the _head_sem

The default limit is 3 simultaneous HEAD requests. This is conservative: it avoids rate-limit errors on shared Borealis instances, at the cost of serialising redirect resolution for the first access to each file.

You can tune this with the head_concurrency constructor parameter:

# Default (safe for shared servers)
fs = fsspec.filesystem("dataverse", host=..., pid=..., head_concurrency=3)

# Higher concurrency — useful if you have a dedicated Borealis instance
# or are opening a dataset with many files for the first time
fs = fsspec.filesystem("dataverse", host=..., pid=..., head_concurrency=10)

Enable logging to measure the actual semaphore wait time and decide whether raising the limit helps:

from loguru import logger
logger.enable("dataversefs")
# Each HEAD log line includes semaphore wait time and total request time

The _head_concurrency value is preserved across Dask serialization so that all workers use the same limit as the driver process.

Dask Serialization

Dask distributes work across workers by pickling the filesystem object. Neither aiohttp.ClientSession nor requests.Session nor threading primitives are picklable.

__getstate__ drops them all:

def __getstate__(self):
    state = self.__dict__.copy()
    state["_session"] = None          # aiohttp session
    state["_http"] = None             # requests.Session
    state["_http_lock"] = threading.Lock()
    state["_head_sem"] = threading.Semaphore(self._head_concurrency)  # preserves tuned limit
    return state

__setstate__ restores everything else. Sessions and locks are recreated lazily on first use by the worker.

The _download_urls cache is picklable and is preserved — S3 URLs resolved before serialization are available on the worker without re-issuing HEAD requests.


4. Glob Pattern Semantics and the _find Override

Why the Default _find Broke Dotted Filenames

fsspec's default glob() pipeline is:

glob(pattern) → find(root, maxdepth) → walk(root) → recursive _ls()

_ls() lists direct children of a path. Repeated calls build a tree. The problem: when a path component starts with a dot (.zattrs, .zgroup) or contains an extension (.bin), the path normalization through walk_ls can silently drop or mismatch entries — returning an empty list even when the files exist in the routing table.

The Fix: Routing-Table Scan

dataversefs overrides _find to scan the routing table directly:

async def _find(self, path, maxdepth=None, withdirs=False, **kwargs):
    table = await self._build_routing_table()
    path = path.strip("/")
    prefix = f"{path}/" if path else ""

    out = {}
    for key, info in table.items():
        if not key or (path and key != path and not key.startswith(prefix)):
            continue
        relative = key[len(prefix):]
        if maxdepth is not None and relative.count("/") + 1 > maxdepth:
            continue
        if info["type"] == "directory" and not withdirs:
            continue
        out[key] = info
    ...

This is a single O(N) pass over the in-memory dict. It never calls _ls recursively, so path normalization issues cannot arise. All keys come directly from the routing table, where they were stored as plain "dir/subdir/file" strings without any protocol prefix or leading slash.

Glob Depth Semantics

glob() extracts a fixed prefix and a depth from the pattern, then calls _find(prefix, maxdepth=depth). The depth is the number of wildcard path components remaining after the prefix:

Pattern Fixed prefix Depth passed to _find
*.bin "" (root) 1
data/*.csv "data/" 1
0_raw/gnssa/*/*/* "0_raw/gnssa/" 3
**/*.bin "" (root) None (unlimited)
0_raw/gnssa/**/*.bin "0_raw/gnssa/" None (unlimited)

The key consequence is that *.bin only finds files at the root level. The * wildcard does not cross directory boundaries. To find .bin files anywhere in the dataset, use **/*.bin.

This is standard POSIX glob behaviour — the same as ls *.bin in a shell or Python's pathlib.Path.glob("*.bin").

Enabling Logging to Diagnose Issues

If glob() returns an empty list unexpectedly, enable logging to inspect what the routing table contains:

from loguru import logger
logger.enable("dataversefs")

import fsspec
fs = fsspec.filesystem("dataverse", host=..., pid=...)
fs.ls("")        # INFO: routing table built — N files, M dirs
fs.glob("**/*")  # shows all paths; compare against your pattern