Emissions store · interactive

What actually gets read: netCDF vs Zarr v3

The same emissions array, the same query — but a very different amount of data crosses the wire. Pick a real workload below (or drag your own region and months) and watch how much of the store each layout has to touch. The point isn't that one format is "better": it's that a build-once, read-on-every-step store lives or dies on the access pattern, and that's exactly what chunking changes.

1The dataset is an array — space × time

One Tier-2 product — say CAMS-GLOB-ANT black carbon, remapped onto your mesh for a year — is a single gridded array with a space axis (your mesh cells) and a time axis (the months). About 415 MB for that one species-year, the same figure the other docs use.

The schematic below stands in for that array: a flattened map (longitude across, latitude down) on one axis, the twelve months on the other. A chunk is a small rectangular block of it that is stored — and compressed, and fetched — as one unit. netCDF read whole is one big object; Zarr v3 is a grid of independently addressable chunks. Everything below follows from that one difference.

2Try it — pick a workload, see what's transferred

Each workload sets a region (on the map) and a month window (on the timeline). Or drag directly on the Your query map and timeline to make your own.

Workload
netCDF access path
Your query — region & months · drag on the grids to adjust
Map — your mesh (longitude → , latitude ↓)
180°W180°E
Months
what you asked for what was read off disk / wire not touched chunk boundary

netCDF

Read off the map
Read across months
415 / 415 MB
100%

Zarr v3 chunk-addressed

Read off the map
Read across months
/ 415 MB

Reading it fairly. Zarr fetches whole chunks, so a query that straddles a chunk edge rounds up to the chunks it touches — that's why "one month, global" still pulls a quarter of the store, not a twelfth. Chunk shape is a design choice you tune to how the data will be read. And flip the netCDF toggle to netCDF-4 hyperslab: a chunked netCDF-4/HDF5 file can read a sub-region too, and the byte counts converge — see the next section for what then stays different.

3The three workloads, and what each one shows

The presets aren't arbitrary — they're the access patterns this store actually has to serve well.

  • Long-running global. Every cell, every step. Here the byte counts are basically equal — you need the whole array eventually. Zarr's win is not fewer bytes; it's that each MPI rank reads its own cells as independent, parallel ranged requests, and you stream one time slice at a time instead of opening a single monolithic file and holding it. The store is built once with a verified global mass budget, so every run and every rank reads byte-identical flux.
  • Regional climatology. A sub-domain, all twelve months (e.g. a multi-year mean over a region). Zarr touches only the column of chunks over your domain across the months — a small fraction of the file. Reading the whole global file to use a corner of it is the cost chunking removes.
  • Regional sub-seasonal. The same sub-domain, but only a season. Now both axes are subset, and the read shrinks again — the smallest footprint of the three, the largest relative win.

The fourth preset, one month, global, is the honest counter-example: a thin slice in time but the whole globe in space. Zarr still beats whole-file, but the time-chunk rounds the one month up to its chunk, so the saving is bounded by the chunk shape — a concrete reminder that chunking helps in proportion to how the chunks line up with how you read.

4"What about different-resolution data?"

A real question from a colleague. It splits into two — and the store answers them in different places.

Inputs at different native resolutions. Inventories don't share a grid: CAMS-GLOB-ANT is 0.1°, fire and biogenic products are coarser, some arrive on entirely different projections. That's normal and fully handled — Tier 1 keeps each inventory at its own native resolution, with a chunk shape chosen to fit it (a fine 0.1° array uses spatial chunks that cover more degrees so each chunk stays a sensible size; a coarse array uses smaller ones). The conservative remap is computed per source grid → your mesh, and Tier 2 is where they all land on one common mesh. So "mixed resolutions in, one resolution out" is the normal path, not a special case.

Tier 1 — native grids CAMS-GLOB-ANT · fine (0.1°) many cells / chunk FINN fire · coarse fewer cells / chunk — same byte size conservative remap (per grid) Tier 2 — one mesh every inventory, your grid per sector / species resolution unified here Your run reads one resolution — the mesh it was built for
Different native resolutions in, one mesh out. Each Tier-1 array is chunked to fit its own resolution; the remap is what reconciles them, and Tier 2 holds the unified result.

Wanting the data at a different resolution than it's stored. This is the honest part. Zarr v3 has no automatic multi-resolution pyramids the way a Cloud-Optimized GeoTIFF does — asking for a coarse view of a fine store still reads the fine chunks and averages them; chunking gives you cheap subsetting, not free decimation. So the design's answer to "I need it at resolution X" is the same answer as everything else: conservatively remap to grid X, once, as another Tier-2 product. If a few target resolutions are used often, they're simply stored as separate Tier-2 stores (a multiscale group), each chunked and each carrying its own provenance — rather than resampled on the fly and risking a non-conservative result.

The rule of thumb

Chunking changes which bytes you read at a fixed resolution — that's what the demo above shows. Changing the resolution itself is a remap, and remapping once into Tier 2 is the whole premise. The store never silently resamples; a different resolution is always a deliberate, conservative, provenance-tracked build.

5So when is plain netCDF still the right call?

Often. If the deliverable is a single file a colleague opens in ncview, or a slice you hand to a downstream tool, netCDF is exactly right — and the store exports to netCDF for precisely that. A chunked netCDF-4/HDF5 file on a local disk can also hyperslab a sub-region efficiently (flip the toggle and the byte counts match Zarr). What the chunk-addressed layout adds is specific to the build-once / serve-many job:

  • Independent parallel writes. Many ranks write different chunks at once with no central file lock — so remapping a fine mesh, or many inventories, scales on HPC.
  • Append without rewrite. A new year, sector, or inventory is new chunks beside the old ones — not a rewritten file.
  • Object-store-native random access. Each chunk is its own key, fetched by an independent ranged GET; a plain netCDF file over the network is fetched as one object unless a server (THREDDS / OPeNDAP) or a kerchunk sidecar fronts it.
  • Parallel reads at run time. Each rank pulls only its partition's chunks, concurrently — the count in the Zarr panel is the number of independent requests, not one serial file traversal.

In other words: with the hyperslab path the bytes can match, but the concurrency, append, and object-store behavior is what makes the chunked layout fit a store that's written by many ranks and read on every model step. That's the trade the demo is really illustrating.

Full design and trade-offs: the technical concept →  ·  the same story from the emissions side: for scientists →