Emissions tooling · the case

Why a build-once emissions store — and what's still open

Across the MUSICA ecosystem we've worked to abstract setup away from the scientist — emissions are one of the last manual steps left. This page makes the case for why we're considering a build-once store, lays out the open questions for the next phase of MIEM, and sketches what a Zarr v3 layer over the ECCAD catalogue could look like. The full design and trade-offs live in the linked docs.

1Why this is even on the table

The through-line of the MUSICA ecosystem has been abstraction — taking as much of the setup off the scientist as we reasonably can (configuration, coupling, mechanism handling) so the work is the science, not the plumbing. Emissions are one of the last places that promise still breaks down.

Getting inventory data onto a model's grid is still a manual, per-user, per-project chore: find the data, sign in to a portal, download it species-by-species and year-by-year, harmonize units and calendars, build remap weights, remap, and wire the result into the model — then do it again for the next mesh, the next project, the next colleague. The next phase of MIEM is where we decide how much of that we can absorb. A build-once emissions store is one candidate answer; this page exists to weigh it before we commit.

What sourcing emissions looks like today (top) versus the build-once proposition (bottom). The questions below are about how far down that second lane we can actually get — and who does the work.

2The questions the next phase has to answer

These are genuinely open. They decide whether a central store is even the right shape — or whether we ship something smaller.

1Who is responsible for sourcing the emissions data?
Today it's the user. Should it stay that way, or does MIEM take it on?
2If it stays with the user, do we standardize on a source?
A specified, well-described portal — e.g. ECCAD — makes a store tractable. An anything-goes mix of formats and native grids does not.
3If we abstract sourcing away from the user, how — and how do we handle sign-in?
This is the crux. ECCAD data sits behind Aeris SSO, and per-inventory redistribution terms may not let us simply re-serve it. See the fork below.
4What is the current (pre-)workflow to retrieve and remap everything a user wants?
The top lane above is a first sketch — worth pinning down precisely, because it's the baseline any store has to beat.
5How could an emissions store change that workflow?
The bottom lane is the hypothesis: request → build-once → serve, with speciation and source-combination still owned by the run.
6…and the questions we haven't asked yet.
Per-inventory licensing, who hosts and pays for storage, update cadence, and version stability across a multi-year study — to be filled in as the phase opens.

The fork that question 3 forces — not yet decided

Authentication and licensing may make a single, central Tier 1 that everyone reads from impractical. If so, the store doesn't have to be central — there are smaller shapes:

A · Central, pre-built Tier 1. Fastest for everyone — but only viable if we're permitted to re-serve the underlying inventories. Aeris SSO and per-inventory redistribution terms are the blocker to resolve first.
B · Ship the builder, not the data. The user authenticates once with their own Aeris credentials; the tool fetches and ingests into a Tier 1 they own. Respects licensing; each site builds its own.
C · Skip Tier 1 — build Tier 2 directly. If a user only needs one or two meshes, go straight to the remapped per-mesh product and never maintain a full native-grid library at all.

3What a Zarr v3 Tier 1 for ECCAD could look like

To make it concrete: here's the scale of the ECCAD catalogue today, and a sketch of how it could map onto a single Zarr v3 store. (Stats from the ECCAD catalogue API; emission groups — geographic masks and ancillary fields are support data, not flux.)

12inventory groups

112sub-inventories

170species

13sectors

16regions

12resolutions

3time intervals

1750–2101year range

The catalogue's own keys — inventory group → sub-inventory → sector → species — map almost directly onto a Zarr v3 group hierarchy. Each sub-inventory node embeds its native grid descriptor (resolution, extent), temporal coverage, units, calendar, and provenance (version / DOI); each leaf is one array per species per sector, chunked along time and space and sharded so a fine catalogue doesn't explode the file count.

eccad-tier1.zarr/ Zarr v3 store · 12 inventory groups

│ zarr.json store + group metadata

│

├─ CAMS/ 15 sub-inventories · Europe + Global

│ ├─ CAMS-GLOB-ANT/ grid: 0.1° lat/lon · 2000–2026 · monthly · DOI

│ │ ├─ power/ NOx CO SO2 NMVOC … [time, lat, lon] chunk (12,256,256) · sharded

│ │ ├─ industry/ … [time, lat, lon]

│ │ ├─ shipping/ … [time, lat, lon]

│ │ ├─ aircraft/ … [time, lev, lat, lon] ← keeps a lev axis

│ │ └─ … (one group per sector)

│ ├─ GFASv1.3/ fire/ … [time, lat, lon]

│ └─ CAMS-GLOB-BIO/ biogenic/ … [time, lat, lon]

│

├─ EDGAR/ EDGARv7/ anthropogenic/ … [time, lat, lon] global · 0.1° · monthly

├─ CEDS/ CEDS/ anthropogenic/ … [time, lat, lon] 0.1° / 0.25° / 0.5°

├─ GEIA/ volcanic/ lightning/ soil/ … [time, lev?, lat, lon]

├─ Future-Scenarios/ SSPs/ RCP*/ … [time, lat, lon] 2000–2101

└─ … ECLIPSE · Inverse-Modelling · REGIONAL · GLOBAL-more-datasets

One store, the catalogue's hierarchy preserved. Sectors stay separate (combined HEMCO-style at read, never pre-summed); arrays keep a lev axis where the inventory has one (aircraft, volcanic).

Why this shape

Native grids stay native. Resolution and region vary per sub-inventory (0.0044°→1°, regional→global), so the grid descriptor lives on the sub-inventory node — Tier 1 never remaps. Remap to a model mesh is Tier 2's job.
Sparse, not 112×13×170. Most sub-inventories carry only a few sectors and species; Zarr only materializes arrays that exist, and sharding keeps even a fine catalogue to a sane file count on Lustre/GPFS.
Appendable by construction. A new inventory, sector, year, or species is a new group or array — nothing already written is touched.

Next — the design itself

Convinced it's worth exploring? The full design and trade-offs are in the concept (technical) and the scientist overview; the interactive demo shows what each storage layout actually reads.