Cache Files Locally¶
By default, every open() call streams bytes from Borealis over HTTP. If you
reopen the same file in a later session the data is fetched again. fsspec's caching
layers sit in front of any protocol and serve data from a local directory after the
first download.
Why cache?¶
- Avoid repeated downloads — iterative analysis (re-running a notebook, tuning a plot) fetches the same bytes once instead of every run.
- Work offline — once cached, data is available without a network connection.
- Speed — local disk reads are much faster than HTTP Range requests for CPU-bound computation.
Whole-file cache (filecache)¶
filecache downloads the entire file on first access and serves it from a local
cache_storage directory on subsequent accesses.
import fsspec
fs = fsspec.filesystem(
"filecache",
target_protocol="dataverse",
target_options={
"host": "borealisdata.ca",
"pid": "doi:10.5683/SP3/EXAMPLE",
},
cache_storage="data/",
same_names=True,
)
# First open: downloads and caches the file
with fs.open("data/values.csv") as f:
df = pd.read_csv(f)
# Second open: served from data/ with no network request
with fs.open("data/values.csv") as f:
df = pd.read_csv(f)
same_names=True stores files under their original basename (values.csv)
instead of a SHA-256 hash, making the cache directory human-readable.
Basename collisions with same_names=True
If two files share the same basename but live in different directories
(e.g. run1/data.csv and run2/data.csv), they map to the same cache file
and whichever was accessed most recently wins. Only use same_names=True
when all filenames in the dataset are unique across directories.
Human-readable names without collisions (BasenameCacheMapper)¶
BasenameCacheMapper(directory_levels=N) keeps N path components above the
filename, joined with _@_. This gives readable cache filenames that are still
unique when basenames repeat:
| Remote path | directory_levels=0 (= same_names=True) |
directory_levels=1 |
|---|---|---|
run1/data.csv |
data.csv |
run1_@_data.csv |
run2/data.csv |
data.csv ← collision |
run2_@_data.csv |
run1/sub/data.csv |
data.csv ← collision |
sub_@_data.csv |
Pass the mapper via the hash_name parameter:
from fsspec.implementations.cache_mapper import BasenameCacheMapper
fs = fsspec.filesystem(
"filecache",
target_protocol="dataverse",
target_options={
"host": "borealisdata.ca",
"pid": "doi:10.5683/SP3/EXAMPLE",
},
cache_storage="data/",
hash_name=BasenameCacheMapper(directory_levels=1),
)
Increase directory_levels if your dataset has deeper repeated structure
(e.g. site/sensor/data.csv needs directory_levels=2).
Note that the cache remains flat — all files land directly in cache_storage
regardless of directory_levels. The path components are encoded into the filename,
not reconstructed as subdirectories.
Block cache (blockcache)¶
blockcache (alias cached) caches only the byte ranges that were actually read.
This is better than filecache for Zarr/Xarray workflows where you compute a
subset of chunks and do not need the full file on disk.
fs = fsspec.filesystem(
"blockcache",
target_protocol="dataverse",
target_options={
"host": "borealisdata.ca",
"pid": "doi:10.5683/SP3/EXAMPLE",
},
cache_storage="data/",
)
The API is identical; swap "filecache" for "blockcache" (or "cached").
The data/ directory pattern¶
Store caches in a data/ directory at the project root — this keeps downloaded data
out of src/ and docs/ and is easy to exclude from version control:
project/
├── data/ ← cached files live here
├── src/
└── docs/
Add data/ to .gitignore:
# .gitignore
data/
Checking what is cached¶
from pathlib import Path
list(Path("data/").glob("*"))
With same_names=True you see the original basenames (flat — no subdirectories).
Without it, filenames are SHA-256 hashes of the full remote path, with a .json
metadata sidecar alongside each hash file.
Cache expiry¶
Cached files expire after one week by default. Override with expiry_time (seconds):
fs = fsspec.filesystem(
"filecache",
target_protocol="dataverse",
target_options={"host": "borealisdata.ca", "pid": "doi:..."},
cache_storage="data/",
expiry_time=0, # 0 = never expire
)
The .json sidecar stores the fetch timestamp; fsspec re-downloads when the entry
is older than expiry_time.
filecache vs blockcache¶
filecache |
blockcache |
|
|---|---|---|
| Unit cached | Whole file | Individual byte ranges |
| Good for | CSV, NetCDF, small files | Large Zarr stores, partial reads |
| Disk usage | Full file size | Only the blocks you read |
| Simplicity | Simpler setup | Same API, slightly more metadata |