Skip to content

Cache Files Locally

By default, every open() call streams bytes from Borealis over HTTP. If you reopen the same file in a later session the data is fetched again. fsspec's caching layers sit in front of any protocol and serve data from a local directory after the first download.


Why cache?

  • Avoid repeated downloads — iterative analysis (re-running a notebook, tuning a plot) fetches the same bytes once instead of every run.
  • Work offline — once cached, data is available without a network connection.
  • Speed — local disk reads are much faster than HTTP Range requests for CPU-bound computation.

Whole-file cache (filecache)

filecache downloads the entire file on first access and serves it from a local cache_storage directory on subsequent accesses.

import fsspec

fs = fsspec.filesystem(
    "filecache",
    target_protocol="dataverse",
    target_options={
        "host": "borealisdata.ca",
        "pid": "doi:10.5683/SP3/EXAMPLE",
    },
    cache_storage="data/",
    same_names=True,
)

# First open: downloads and caches the file
with fs.open("data/values.csv") as f:
    df = pd.read_csv(f)

# Second open: served from data/ with no network request
with fs.open("data/values.csv") as f:
    df = pd.read_csv(f)

same_names=True stores files under their original basename (values.csv) instead of a SHA-256 hash, making the cache directory human-readable.

Basename collisions with same_names=True

If two files share the same basename but live in different directories (e.g. run1/data.csv and run2/data.csv), they map to the same cache file and whichever was accessed most recently wins. Only use same_names=True when all filenames in the dataset are unique across directories.

Human-readable names without collisions (BasenameCacheMapper)

BasenameCacheMapper(directory_levels=N) keeps N path components above the filename, joined with _@_. This gives readable cache filenames that are still unique when basenames repeat:

Remote path directory_levels=0 (= same_names=True) directory_levels=1
run1/data.csv data.csv run1_@_data.csv
run2/data.csv data.csvcollision run2_@_data.csv
run1/sub/data.csv data.csvcollision sub_@_data.csv

Pass the mapper via the hash_name parameter:

from fsspec.implementations.cache_mapper import BasenameCacheMapper

fs = fsspec.filesystem(
    "filecache",
    target_protocol="dataverse",
    target_options={
        "host": "borealisdata.ca",
        "pid": "doi:10.5683/SP3/EXAMPLE",
    },
    cache_storage="data/",
    hash_name=BasenameCacheMapper(directory_levels=1),
)

Increase directory_levels if your dataset has deeper repeated structure (e.g. site/sensor/data.csv needs directory_levels=2).

Note that the cache remains flat — all files land directly in cache_storage regardless of directory_levels. The path components are encoded into the filename, not reconstructed as subdirectories.


Block cache (blockcache)

blockcache (alias cached) caches only the byte ranges that were actually read. This is better than filecache for Zarr/Xarray workflows where you compute a subset of chunks and do not need the full file on disk.

fs = fsspec.filesystem(
    "blockcache",
    target_protocol="dataverse",
    target_options={
        "host": "borealisdata.ca",
        "pid": "doi:10.5683/SP3/EXAMPLE",
    },
    cache_storage="data/",
)

The API is identical; swap "filecache" for "blockcache" (or "cached").


The data/ directory pattern

Store caches in a data/ directory at the project root — this keeps downloaded data out of src/ and docs/ and is easy to exclude from version control:

project/
├── data/          ← cached files live here
├── src/
└── docs/

Add data/ to .gitignore:

# .gitignore
data/

Checking what is cached

from pathlib import Path

list(Path("data/").glob("*"))

With same_names=True you see the original basenames (flat — no subdirectories). Without it, filenames are SHA-256 hashes of the full remote path, with a .json metadata sidecar alongside each hash file.


Cache expiry

Cached files expire after one week by default. Override with expiry_time (seconds):

fs = fsspec.filesystem(
    "filecache",
    target_protocol="dataverse",
    target_options={"host": "borealisdata.ca", "pid": "doi:..."},
    cache_storage="data/",
    expiry_time=0,        # 0 = never expire
)

The .json sidecar stores the fetch timestamp; fsspec re-downloads when the entry is older than expiry_time.


filecache vs blockcache

filecache blockcache
Unit cached Whole file Individual byte ranges
Good for CSV, NetCDF, small files Large Zarr stores, partial reads
Disk usage Full file size Only the blocks you read
Simplicity Simpler setup Same API, slightly more metadata