A build-once emissions store
Prepare offline inventories once — conservatively remapped to a model's grid and kept behind a provenance manifest — so the runtime just serves flux. A separate, model-agnostic tool; MIEM (or any host) is simply a consumer.
1The concept
Two tiers and a manifest. Tier 1 is the canonical library of inventories on their native grids — the appendable source of truth. Tier 2 is per-grid data, conservatively remapped once and materialized. The manifest keys derived data to a provenance hash, so the runtime checks-or-builds and then only ever reads.
Scope. Static, offline gridded inventories — anthropogenic, fire, and offline biogenic flux. Online emission schemes (online biogenic, dust, sea-salt, ocean DMS, lightning NOx) are computed by the host model at runtime from its own meteorology and are out of scope here.
- Store the data; don't fit profiles. Zarr codecs compress losslessly; reserve a base×temporal-profile form only for inventories natively given that way. Episodic sources (fire, volcanic) have no stable cycle — keep them stored.
- Sectors are preserved, not pre-summed. CAMS-GLOB-ANT ships per sector; remap each sector once (weights are shared across sectors on a grid) and keep them separate — so each sector can be its own MIEM source (its own species map, scaling, category/hierarchy, vertical injection) with per-sector diagnostics. Sources are combined HEMCO-style (categories sum; highest hierarchy wins per cell) at read, not pre-summed at ingest.
- Vertical structure is preserved. Inventories with an injection dimension — aircraft by flight level, volcanic by altitude — keep a
levcoordinate; surface sources stay 2-D. Plume-rise / injection allocation is the consumer's job. - Time is a dimension, not a partition. Slice any window at read; the manifest carries a time-range only as coverage metadata for reuse validity. Units and calendar (e.g. 365-day no-leap vs. real, with a leap-day policy) are fixed per array; temporal application stays with the consumer.
- Conservation is explicit and verified. The build conserves the global mass integral (Σ flux×area) and records the area definition, edge/partial-cell normalization, and missing-value / negative-clip policy. Before/after mass-budget diagnostics go in the manifest; a mismatch beyond tolerance fails the build.
- Generate weights once, apply many — own the Zarr write. Remapping splits in two: a proven engine (ESMF / TempestRemap / UPTEMPO) generates the conservative weight file — a stored, hashed artifact reused across every sector, species, and time step on that grid pair — and the store applies the weights (a sparse mat-vec, per chunk) and writes Zarr directly. No netCDF→Zarr round-trip: external remappers only emit netCDF, so the store owns the cheap application-and-write half.
- Centralize Tier 1 so Tier-2 builds read only what they need. Today a remap reads a whole monolithic inventory file end-to-end — ≈415 MB for one CAMS-GLOB-ANT black-carbon species-year — to emit a ~90 MB remapped file, repeated per species / sector / year. A chunked, centralized Tier 1 lets the build pull only the chunks in scope (time window, species/sectors, and for a limited-area mesh just the domain), compute weights once per source-grid/mesh pair and reuse them across that inventory's species and sectors, stream in parallel, and rebuild incrementally when one piece changes.
- Mesh is the build-once unit — descriptor embedded, not referenced. Each target grid gets its own Tier-2 store that embeds the resolved SCRIP/ESMF grid descriptor (cell centers, corners, areas, mask) — sourced from the model's grid file (MPAS
static.nc, WRFgeo_em, CAM-FV, …) but stored model-agnostically. Its content hash feeds the provenance hash, so a changed mesh forces a rebuild. - The provenance hash is input-based. It keys on schema version + Tier-1 content checksums + grid-descriptor hash + remap method + pinned remap-engine version — not on the float weights (not bitwise-reproducible across decompositions). Reuse validity tracks inputs, not outputs. Store and manifest carry a format version so readers fail loudly on an incompatible store.
- Build to temp, atomic rename. On a POSIX filesystem (incl. Lustre / GPFS) a same-filesystem rename is atomic, so readers never see a half-built store. (Object stores have no atomic rename — revisit with a manifest-pointer flip if/when S3 comes into scope.)
- Windowed reads compose for regional targets. A lat/lon bbox (+buffer) reads only the chunks overlapping a limited-area domain (e.g. WRF) — a read-side saving classic HEMCO doesn't do. For a global mesh every cell is in-domain, so the saving is moot; on an unstructured mesh it depends on a locality-preserving cell ordering (see open questions).
lev axis where the inventory
has one (aircraft, volcanic). The consumer (MIEM) interpolates in time, applies the inventory→mechanism species map, and
combines sources HEMCO-style at read — never pre-summed at ingest.2Why a derived store, not just netCDF
netCDF is the right interchange format — but a build-once store that's appended to, partially read, and written in parallel wants a chunked, multi-array layout. The two aren't rivals: the store is Zarr internally and can still export netCDF slices for consumers that need them.
| Concern | Monolithic netCDF | Zarr v3 derived store |
|---|---|---|
| Add an inventory | rewrite / unlimited-dim hacks | add an array to a group |
| Many inventories + meshes | one grid per file → many files | groups in one logical store |
| Partial read | chunked hyperslab (netCDF-4 / HDF5), but whole-file open | windowed per-chunk, object-store-friendly |
| Parallel writes | mature via PnetCDF / HDF5 collective MPI-IO | lock-free, sharded, library-light |
| Object store (S3) | weak | native |
| Interchange ubiquity | lingua franca CF, every tool reads it | newer — export netCDF slices for legacy consumers |
3Making it model-agnostic
Nothing in the store is MPAS- or MIEM-specific. Tier 2 is keyed on an abstract target-grid
descriptor (a SCRIP/ESMF grid + hash) that each per-grid store embeds — sourced from the model's own
grid file (MPAS static.nc, WRF geo_em, CAM-FV, …) but stored model-agnostically; the remap engine
handles structured, unstructured, and cubed-sphere grids alike. Speciation stays with the consumer, so the store is
mechanism-agnostic too.
In: an inventory set (per sector), a target-grid descriptor (SCRIP/ESMF), a time window, a remap method,
and the declared units + calendar. Out: a derived store of remapped per-cell flux — kept per sector and
per species, with a lev dimension where the inventory has one — plus a manifest carrying provenance and
mass-budget diagnostics. The store holds inventory species; mapping to a chemical mechanism is the consumer's job —
MIEM applies the configured inventory→mechanism species map at runtime and combines sources HEMCO-style. MIEM is one
consumer of many — WRF-Chem, CAM-chem, or a GOCART-2G host could read
the same store, directly or via a netCDF export.
4Open questions — build-time decisions
Conceptually settled above; these are implementation choices deferred to the build tool, noted here so they aren't mistaken for solved.
- Single-builder election (thundering herd). "Check-or-build" must keep N concurrent ranks from all
rebuilding the same store. Options: rank-0-builds + barrier within an MPI job; an
O_CREAT|O_EXCLlockfile / lease with stale-lock recovery across independent jobs. Atomic rename gives isolation, not coordination. - Unstructured cell ordering for windowed reads. To make a bbox read selective on a 1-D
nCellsarray, Tier-2 cells would be permuted into a locality-preserving (space-filling-curve) order and chunked along it, with a bbox→chunk index built from the embedded cell coordinates. Moot for global runs. - Schema migration. When the store layout or manifest schema changes, the format version drives a documented migration path rather than an undefined read failure.
- Temporal profiles — store-provided or host-owned? The consumer (MIEM) interpolates in time but does not apply diurnal / weekly / seasonal scaling profiles. Since the store owns the time dimension, it could provide them directly — either carry a base flux + normalized profile arrays (the consumer multiplies) or bake higher-resolution flux at build time. Local-time alignment is cheap: the store already carries each cell's longitude, so the UTC→local offset is just a per-cell attribute (≈ lon/15 h) and diurnal profiles apply in local solar time by convention (no time-zone or DST database). The open part is the policy: store-supplied profiles vs. host-owned.