Work with Zip Archives¶
Some Dataverse datasets bundle many files into zip archives — for example, large waveform collections where individual files number in the thousands. This guide covers three strategies for accessing them via dataversefs.
Strategies at a Glance¶
| Strategy | Best for | Downloads |
|---|---|---|
Layered ZipFileSystem |
A few files from a moderate-size archive | Central dir + selected entries only |
In-memory (io.BytesIO) |
All files from a small archive (< ~50 MB) | Full archive |
| Download to local disk | Many files or multi-GB archives | Full archive |
Layered ZipFileSystem — Range Requests Into the Zip¶
Because DataverseFile is seekable (it translates seek() calls into HTTP
Range requests), fsspec's ZipFileSystem can be layered on top. Python's
zipfile engine then:
- Fetches the ZIP central directory from the end of the file (~1–2 Range requests)
- Seeks directly to each requested entry and fetches only its compressed bytes
This is far more efficient than downloading the full archive when you need only a handful of files.
import fsspec
from fsspec.implementations.zip import ZipFileSystem
import dataversefs # noqa: F401
fs = fsspec.filesystem("dataverse", host="borealisdata.ca", pid="doi:...", token=TOKEN)
zip_path = "path/to/archive.zip"
zip_fo = fs.open(zip_path, "rb")
zip_fs = ZipFileSystem(zip_fo)
# List the top-level entries in the archive
entries = zip_fs.ls("", detail=False)
print(entries[:5])
# Find all files of a specific type
all_files = zip_fs.find("", detail=False)
mseed_files = [f for f in all_files if f.endswith(".mseed")]
# Open and read one entry — only its compressed bytes are transferred
with zip_fs.open(mseed_files[0], "rb") as f:
data = f.read() # or pass f directly to obspy.read(), pandas, etc.
zip_fo.close()
Blocksize tuning
DataverseFile uses a 4 MB read-ahead buffer by default. When reading many
small, scattered entries from a zip, a smaller blocksize reduces
over-fetching: fs.open(zip_path, "rb", blocksize=2**16).
When this approach is less efficient
If you need most or all of the files in a large archive, repeated Range requests add overhead. Use local download instead.
In-Memory Extraction — Small Zips¶
For archives under ~50 MB, the simplest approach is to download the whole zip
into an io.BytesIO buffer and extract from memory.
import io
import zipfile
zip_bytes = fs.cat("path/to/small_archive.zip")
print(f"Downloaded {len(zip_bytes):,} bytes")
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
# Inspect contents
print(zf.namelist()[:10])
# Read a specific entry
content = zf.read("subdir/file.txt").decode()
print(content)
This is ideal for metadata or parameter archives where you want all entries and the total size is manageable.
Download to Local Disk — Bulk Access¶
For multi-GB archives, or when you need to process most entries, download the zip once and work locally.
import zipfile
local_zip = "local_archive.zip"
fs.get("path/to/large_archive.zip", local_zip)
with zipfile.ZipFile(local_zip) as zf:
for name in zf.namelist():
if name.endswith(".mseed"):
zf.extract(name, path="output/")
You can also extract directly to a directory without iterating manually:
with zipfile.ZipFile(local_zip) as zf:
zf.extractall("output/")
Reading MiniSEED Files with ObsPy¶
obspy.read() accepts any seekable file-like object, so it works directly
with files opened from a ZipFileSystem:
import obspy
from fsspec.implementations.zip import ZipFileSystem
zip_fo = fs.open("path/to/waveforms.zip", "rb")
zip_fs = ZipFileSystem(zip_fo)
with zip_fs.open("event_000/station_HNZ.mseed", "rb") as f:
st = obspy.read(f)
print(st)
st[0].plot() # requires matplotlib
zip_fo.close()
After downloading locally, use the standard file path:
st = obspy.read("output/event_000/station_HNZ.mseed")