Skip to content

Work with Zip Archives

Some Dataverse datasets bundle many files into zip archives — for example, large waveform collections where individual files number in the thousands. This guide covers three strategies for accessing them via dataversefs.

Strategies at a Glance

Strategy Best for Downloads
Layered ZipFileSystem A few files from a moderate-size archive Central dir + selected entries only
In-memory (io.BytesIO) All files from a small archive (< ~50 MB) Full archive
Download to local disk Many files or multi-GB archives Full archive

Layered ZipFileSystem — Range Requests Into the Zip

Because DataverseFile is seekable (it translates seek() calls into HTTP Range requests), fsspec's ZipFileSystem can be layered on top. Python's zipfile engine then:

  1. Fetches the ZIP central directory from the end of the file (~1–2 Range requests)
  2. Seeks directly to each requested entry and fetches only its compressed bytes

This is far more efficient than downloading the full archive when you need only a handful of files.

import fsspec
from fsspec.implementations.zip import ZipFileSystem

import dataversefs  # noqa: F401

fs = fsspec.filesystem("dataverse", host="borealisdata.ca", pid="doi:...", token=TOKEN)

zip_path = "path/to/archive.zip"
zip_fo = fs.open(zip_path, "rb")
zip_fs = ZipFileSystem(zip_fo)

# List the top-level entries in the archive
entries = zip_fs.ls("", detail=False)
print(entries[:5])

# Find all files of a specific type
all_files = zip_fs.find("", detail=False)
mseed_files = [f for f in all_files if f.endswith(".mseed")]

# Open and read one entry — only its compressed bytes are transferred
with zip_fs.open(mseed_files[0], "rb") as f:
    data = f.read()  # or pass f directly to obspy.read(), pandas, etc.

zip_fo.close()

Blocksize tuning

DataverseFile uses a 4 MB read-ahead buffer by default. When reading many small, scattered entries from a zip, a smaller blocksize reduces over-fetching: fs.open(zip_path, "rb", blocksize=2**16).

When this approach is less efficient

If you need most or all of the files in a large archive, repeated Range requests add overhead. Use local download instead.


In-Memory Extraction — Small Zips

For archives under ~50 MB, the simplest approach is to download the whole zip into an io.BytesIO buffer and extract from memory.

import io
import zipfile

zip_bytes = fs.cat("path/to/small_archive.zip")
print(f"Downloaded {len(zip_bytes):,} bytes")

with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
    # Inspect contents
    print(zf.namelist()[:10])

    # Read a specific entry
    content = zf.read("subdir/file.txt").decode()
    print(content)

This is ideal for metadata or parameter archives where you want all entries and the total size is manageable.


Download to Local Disk — Bulk Access

For multi-GB archives, or when you need to process most entries, download the zip once and work locally.

import zipfile

local_zip = "local_archive.zip"
fs.get("path/to/large_archive.zip", local_zip)

with zipfile.ZipFile(local_zip) as zf:
    for name in zf.namelist():
        if name.endswith(".mseed"):
            zf.extract(name, path="output/")

You can also extract directly to a directory without iterating manually:

with zipfile.ZipFile(local_zip) as zf:
    zf.extractall("output/")

Reading MiniSEED Files with ObsPy

obspy.read() accepts any seekable file-like object, so it works directly with files opened from a ZipFileSystem:

import obspy
from fsspec.implementations.zip import ZipFileSystem

zip_fo = fs.open("path/to/waveforms.zip", "rb")
zip_fs = ZipFileSystem(zip_fo)

with zip_fs.open("event_000/station_HNZ.mseed", "rb") as f:
    st = obspy.read(f)

print(st)
st[0].plot()  # requires matplotlib

zip_fo.close()

After downloading locally, use the standard file path:

st = obspy.read("output/event_000/station_HNZ.mseed")