Explanation¶
Design decisions and background context for dataversefs.
1. Flat-to-Hierarchical Tree Reconstruction¶
The Problem¶
Dataverse stores files in a flat list. Each file entry has a directoryLabel field —
a slash-separated string like "data/zarr/.zgroup" — but there is no actual directory
object in the API response.
When Xarray opens a Zarr store, it calls fs.ls("dual_heading.zarr") to discover
variables, then fs.ls("dual_heading.zarr/temperature") to find chunks. Without
explicit directory entries for every path prefix, ls() on a parent directory would
return an empty list even when files live beneath it.
The Solution¶
_build_routing_table() iterates over the flat file list and creates a synthetic
directory entry for every ancestor path. For a file with
directoryLabel = "data/zarr/.zgroup" and filename = "0.0":
Full path: data/zarr/.zgroup/0.0 (type: file)
Parents: data/zarr/.zgroup (type: directory, synthetic)
data/zarr (type: directory, synthetic)
data (type: directory, synthetic)
"" (type: directory, root)
All entries live in a single in-memory dict (_routing_table). _ls() collects keys
whose path starts with the requested prefix and has no further / after the prefix —
giving exactly the direct children.
Why Zarr Breaks Without This¶
Zarr reads metadata by calling fs.cat(".zgroup"), fs.cat(".zattrs"), etc. at
relative paths within the store root. If the filesystem root is not scoped to the
dataset (e.g. a full Borealis instance), these relative paths would be ambiguous. The
dataset-scoped root ensures "dual_heading.zarr/.zgroup" maps to exactly one file in
the routing table, regardless of other datasets on the same host.
2. S3 Redirect and URL Caching¶
The Borealis Access Pattern¶
When you request GET /api/access/datafile/{id}, Borealis does not return the file
directly. It returns an HTTP 303 redirect to a pre-signed Amazon S3 URL. The S3 URL
carries time-limited credentials embedded as query parameters (X-Amz-Signature, etc.).
Client Borealis S3
|--HEAD /api/access/...-->| |
|<--303 Location: https://s3.../file?X-Amz-...--| |
|--GET https://s3.../file?X-Amz-...-Range:0-1023---->|
|<--206 Partial Content---------------------------------|
Why Naive Redirect-Following is Slow for Zarr¶
A Zarr store with 100 variables × 50 chunks = 5 000 chunk reads. If each read triggers a round-trip to Dataverse for the redirect, that is 5 000 extra HTTP calls to a server that may be in a different region than the S3 bucket. At even 20 ms per round-trip, that is 100 seconds of overhead.
The Cache¶
_resolve_download_url(file_id) issues a single HEAD request with
allow_redirects=False, reads the Location header, and stores the S3 URL in
self._download_urls[file_id]. Every subsequent _cat_file or _fetch_range call for
that file goes directly to S3.
async def _resolve_download_url(self, file_id: int) -> str:
if file_id in self._download_urls:
return self._download_urls[file_id] # cache hit
# ... HEAD request, store result
self._download_urls[file_id] = url
return url
Pre-Signed URL Expiry¶
Pre-signed S3 URLs expire (Borealis typically uses 1-hour TTLs). For long-running Dask
jobs, create a fresh DataverseFileSystem instance after expiry. The routing table is
preserved across instances if you re-use the same storage_options; only the URL cache
needs refreshing.
3. Async Architecture and Dask Serialization¶
AsyncFileSystem¶
dataversefs subclasses fsspec.asyn.AsyncFileSystem. Internal methods (_ls,
_info, _cat_file, _build_routing_table) are all async def.
The base class auto-generates synchronous public methods (ls, info, cat_file) that
call the async internals via fsspec.asyn.sync(). This means:
- Synchronous callers (Pandas, plain scripts) get blocking calls that work correctly.
- Async callers (Dask async workers) can
awaitthe_methods directly. - You never need to manage an event loop yourself.
Split I/O Strategy¶
dataversefs uses two separate HTTP mechanisms:
| Task | Mechanism | Why |
|---|---|---|
_build_routing_table |
aiohttp.ClientSession |
Runs once in fsspec's stable background loop; async is fine here |
File downloads (_head_sync, _get_bytes_sync) |
sync requests.Session via run_in_executor |
No event-loop affinity — safe from zarr 3's ProactorEventLoop, Jupyter, and Dask workers |
zarr 3 creates its own ProactorEventLoop on Windows. If file downloads used aiohttp,
two ProactorEventLoops would try to share TCP connections, causing
ConnectionResetError (WinError 10054). Sync requests running in a thread executor
has no event-loop attachment and works from any calling context.
_head_sem is a threading.Semaphore that limits concurrent HEAD requests to
Dataverse. When zarr fires dozens of simultaneous _cat_file calls via
asyncio.gather, without a limit, too many simultaneous HTTPS connections to Borealis
trigger server-side rate limiting. S3 GET requests are not throttled — only the
Dataverse HEAD redirect resolution is limited.
HEAD Concurrency and the _head_sem¶
The default limit is 3 simultaneous HEAD requests. This is conservative: it avoids rate-limit errors on shared Borealis instances, at the cost of serialising redirect resolution for the first access to each file.
You can tune this with the head_concurrency constructor parameter:
# Default (safe for shared servers)
fs = fsspec.filesystem("dataverse", host=..., pid=..., head_concurrency=3)
# Higher concurrency — useful if you have a dedicated Borealis instance
# or are opening a dataset with many files for the first time
fs = fsspec.filesystem("dataverse", host=..., pid=..., head_concurrency=10)
Enable logging to measure the actual semaphore wait time and decide whether raising the limit helps:
from loguru import logger
logger.enable("dataversefs")
# Each HEAD log line includes semaphore wait time and total request time
The _head_concurrency value is preserved across Dask serialization so that all
workers use the same limit as the driver process.
Dask Serialization¶
Dask distributes work across workers by pickling the filesystem object. Neither
aiohttp.ClientSession nor requests.Session nor threading primitives are picklable.
__getstate__ drops them all:
def __getstate__(self):
state = self.__dict__.copy()
state["_session"] = None # aiohttp session
state["_http"] = None # requests.Session
state["_http_lock"] = threading.Lock()
state["_head_sem"] = threading.Semaphore(self._head_concurrency) # preserves tuned limit
return state
__setstate__ restores everything else. Sessions and locks are recreated lazily on
first use by the worker.
The _download_urls cache is picklable and is preserved — S3 URLs resolved
before serialization are available on the worker without re-issuing HEAD requests.
4. Glob Pattern Semantics and the _find Override¶
Why the Default _find Broke Dotted Filenames¶
fsspec's default glob() pipeline is:
glob(pattern) → find(root, maxdepth) → walk(root) → recursive _ls()
_ls() lists direct children of a path. Repeated calls build a tree. The problem: when
a path component starts with a dot (.zattrs, .zgroup) or contains an extension
(.bin), the path normalization through walk → _ls can silently drop or mismatch
entries — returning an empty list even when the files exist in the routing table.
The Fix: Routing-Table Scan¶
dataversefs overrides _find to scan the routing table directly:
async def _find(self, path, maxdepth=None, withdirs=False, **kwargs):
table = await self._build_routing_table()
path = path.strip("/")
prefix = f"{path}/" if path else ""
out = {}
for key, info in table.items():
if not key or (path and key != path and not key.startswith(prefix)):
continue
relative = key[len(prefix):]
if maxdepth is not None and relative.count("/") + 1 > maxdepth:
continue
if info["type"] == "directory" and not withdirs:
continue
out[key] = info
...
This is a single O(N) pass over the in-memory dict. It never calls _ls recursively,
so path normalization issues cannot arise. All keys come directly from the routing table,
where they were stored as plain "dir/subdir/file" strings without any protocol prefix
or leading slash.
Glob Depth Semantics¶
glob() extracts a fixed prefix and a depth from the pattern, then calls
_find(prefix, maxdepth=depth). The depth is the number of wildcard path components
remaining after the prefix:
| Pattern | Fixed prefix | Depth passed to _find |
|---|---|---|
*.bin |
"" (root) |
1 |
data/*.csv |
"data/" |
1 |
0_raw/gnssa/*/*/* |
"0_raw/gnssa/" |
3 |
**/*.bin |
"" (root) |
None (unlimited) |
0_raw/gnssa/**/*.bin |
"0_raw/gnssa/" |
None (unlimited) |
The key consequence is that *.bin only finds files at the root level. The *
wildcard does not cross directory boundaries. To find .bin files anywhere in the
dataset, use **/*.bin.
This is standard POSIX glob behaviour — the same as ls *.bin in a shell or Python's
pathlib.Path.glob("*.bin").
Enabling Logging to Diagnose Issues¶
If glob() returns an empty list unexpectedly, enable logging to inspect what the
routing table contains:
from loguru import logger
logger.enable("dataversefs")
import fsspec
fs = fsspec.filesystem("dataverse", host=..., pid=...)
fs.ls("") # INFO: routing table built — N files, M dirs
fs.glob("**/*") # shows all paths; compare against your pattern