Why a build-once emissions store — and what's still open
Across the MUSICA ecosystem we've worked to abstract setup away from the scientist — emissions are one of the last manual steps left. This page makes the case for why we're considering a build-once store, lays out the open questions for the next phase of MIEM, and sketches what a Zarr v3 layer over the ECCAD catalogue could look like. The full design and trade-offs live in the linked docs.
1Why this is even on the table
The through-line of the MUSICA ecosystem has been abstraction — taking as much of the setup off the scientist as we reasonably can (configuration, coupling, mechanism handling) so the work is the science, not the plumbing. Emissions are one of the last places that promise still breaks down.
Getting inventory data onto a model's grid is still a manual, per-user, per-project chore: find the data, sign in to a portal, download it species-by-species and year-by-year, harmonize units and calendars, build remap weights, remap, and wire the result into the model — then do it again for the next mesh, the next project, the next colleague. The next phase of MIEM is where we decide how much of that we can absorb. A build-once emissions store is one candidate answer; this page exists to weigh it before we commit.
2The questions the next phase has to answer
These are genuinely open. They decide whether a central store is even the right shape — or whether we ship something smaller.
- 1Who is responsible for sourcing the emissions data?
Today it's the user. Should it stay that way, or does MIEM take it on?
- 2If it stays with the user, do we standardize on a source?
A specified, well-described portal — e.g. ECCAD — makes a store tractable. An anything-goes mix of formats and native grids does not.
- 3If we abstract sourcing away from the user, how — and how do we handle sign-in?
This is the crux. ECCAD data sits behind Aeris SSO, and per-inventory redistribution terms may not let us simply re-serve it. See the fork below.
- 4What is the current (pre-)workflow to retrieve and remap everything a user wants?
The top lane above is a first sketch — worth pinning down precisely, because it's the baseline any store has to beat.
- 5How could an emissions store change that workflow?
The bottom lane is the hypothesis: request → build-once → serve, with speciation and source-combination still owned by the run.
- 6…and the questions we haven't asked yet.
Per-inventory licensing, who hosts and pays for storage, update cadence, and version stability across a multi-year study — to be filled in as the phase opens.
Authentication and licensing may make a single, central Tier 1 that everyone reads from impractical. If so, the store doesn't have to be central — there are smaller shapes:
- A · Central, pre-built Tier 1. Fastest for everyone — but only viable if we're permitted to re-serve the underlying inventories. Aeris SSO and per-inventory redistribution terms are the blocker to resolve first.
- B · Ship the builder, not the data. The user authenticates once with their own Aeris credentials; the tool fetches and ingests into a Tier 1 they own. Respects licensing; each site builds its own.
- C · Skip Tier 1 — build Tier 2 directly. If a user only needs one or two meshes, go straight to the remapped per-mesh product and never maintain a full native-grid library at all.
3What a Zarr v3 Tier 1 for ECCAD could look like
To make it concrete: here's the scale of the ECCAD catalogue today, and a sketch of how it could map onto a single Zarr v3 store. (Stats from the ECCAD catalogue API; emission groups — geographic masks and ancillary fields are support data, not flux.)
The catalogue's own keys — inventory group → sub-inventory → sector → species — map almost directly onto a Zarr v3 group hierarchy. Each sub-inventory node embeds its native grid descriptor (resolution, extent), temporal coverage, units, calendar, and provenance (version / DOI); each leaf is one array per species per sector, chunked along time and space and sharded so a fine catalogue doesn't explode the file count.
lev axis where the inventory has one (aircraft, volcanic).- Native grids stay native. Resolution and region vary per sub-inventory (0.0044°→1°, regional→global), so the grid descriptor lives on the sub-inventory node — Tier 1 never remaps. Remap to a model mesh is Tier 2's job.
- Sparse, not 112×13×170. Most sub-inventories carry only a few sectors and species; Zarr only materializes arrays that exist, and sharding keeps even a fine catalogue to a sane file count on Lustre/GPFS.
- Appendable by construction. A new inventory, sector, year, or species is a new group or array — nothing already written is touched.
Convinced it's worth exploring? The full design and trade-offs are in the concept (technical) and the scientist overview; the interactive demo shows what each storage layout actually reads.