Emissions tooling · concept

A build-once emissions store

Prepare offline inventories once — conservatively remapped to a model's grid and kept behind a provenance manifest — so the runtime just serves flux. A separate, model-agnostic tool; MIEM (or any host) is simply a consumer.

1The concept

Two tiers and a manifest. Tier 1 is the canonical library of inventories on their native grids — the appendable source of truth. Tier 2 is per-grid data, conservatively remapped once and materialized. The manifest keys derived data to a provenance hash, so the runtime checks-or-builds and then only ever reads.

Scope. Static, offline gridded inventories — anthropogenic, fire, and offline biogenic flux. Online emission schemes (online biogenic, dust, sea-salt, ocean DMS, lightning NOx) are computed by the host model at runtime from its own meteorology and are out of scope here.

native inventories (Tier 1) remapped / build (Tier 2) manifest runtime consumers
CAMS-GLOB-ANT anthropogenic · lat/lon FINN fire · lat/lon CAMS-GLOB-BIO offline biogenic flux · lat/lon + new inventory just add an array Tier 1 — native-grid inventory library Zarr v3 · one array per inventory · append-friendly · stays on native lat/lon Target grid SCRIP / ESMF descriptor e.g. MPAS static.nc Build — conservative remap, once native grid → target grid generate weights once · apply to every array per (inventory set × grid × time window) Manifest / index check-or-build provenance hash Tier 2 — store · mesh A MPAS x1.163842 · remapped flux + provenance · embeds grid descriptor Tier 2 — store · mesh B MPAS x1.10242 · remapped flux built once · served many Runtime consumers — serve every step MIEM: read slice → time-interp → map species → HEMCO combine → flux no remap at runtime — the costly step already ran in Build
Build once (offline or on first call), serve many (every model step). The remap — the expensive part — happens exactly once per (inventory set × grid × time window).
Design decisions we settled on
  • Store the data; don't fit profiles. Zarr codecs compress losslessly; reserve a base×temporal-profile form only for inventories natively given that way. Episodic sources (fire, volcanic) have no stable cycle — keep them stored.
  • Sectors are preserved, not pre-summed. CAMS-GLOB-ANT ships per sector; remap each sector once (weights are shared across sectors on a grid) and keep them separate — so each sector can be its own MIEM source (its own species map, scaling, category/hierarchy, vertical injection) with per-sector diagnostics. Sources are combined HEMCO-style (categories sum; highest hierarchy wins per cell) at read, not pre-summed at ingest.
  • Vertical structure is preserved. Inventories with an injection dimension — aircraft by flight level, volcanic by altitude — keep a lev coordinate; surface sources stay 2-D. Plume-rise / injection allocation is the consumer's job.
  • Time is a dimension, not a partition. Slice any window at read; the manifest carries a time-range only as coverage metadata for reuse validity. Units and calendar (e.g. 365-day no-leap vs. real, with a leap-day policy) are fixed per array; temporal application stays with the consumer.
  • Conservation is explicit and verified. The build conserves the global mass integral (Σ flux×area) and records the area definition, edge/partial-cell normalization, and missing-value / negative-clip policy. Before/after mass-budget diagnostics go in the manifest; a mismatch beyond tolerance fails the build.
  • Generate weights once, apply many — own the Zarr write. Remapping splits in two: a proven engine (ESMF / TempestRemap / UPTEMPO) generates the conservative weight file — a stored, hashed artifact reused across every sector, species, and time step on that grid pair — and the store applies the weights (a sparse mat-vec, per chunk) and writes Zarr directly. No netCDF→Zarr round-trip: external remappers only emit netCDF, so the store owns the cheap application-and-write half.
  • Centralize Tier 1 so Tier-2 builds read only what they need. Today a remap reads a whole monolithic inventory file end-to-end — ≈415 MB for one CAMS-GLOB-ANT black-carbon species-year — to emit a ~90 MB remapped file, repeated per species / sector / year. A chunked, centralized Tier 1 lets the build pull only the chunks in scope (time window, species/sectors, and for a limited-area mesh just the domain), compute weights once per source-grid/mesh pair and reuse them across that inventory's species and sectors, stream in parallel, and rebuild incrementally when one piece changes.
  • Mesh is the build-once unit — descriptor embedded, not referenced. Each target grid gets its own Tier-2 store that embeds the resolved SCRIP/ESMF grid descriptor (cell centers, corners, areas, mask) — sourced from the model's grid file (MPAS static.nc, WRF geo_em, CAM-FV, …) but stored model-agnostically. Its content hash feeds the provenance hash, so a changed mesh forces a rebuild.
  • The provenance hash is input-based. It keys on schema version + Tier-1 content checksums + grid-descriptor hash + remap method + pinned remap-engine version — not on the float weights (not bitwise-reproducible across decompositions). Reuse validity tracks inputs, not outputs. Store and manifest carry a format version so readers fail loudly on an incompatible store.
  • Build to temp, atomic rename. On a POSIX filesystem (incl. Lustre / GPFS) a same-filesystem rename is atomic, so readers never see a half-built store. (Object stores have no atomic rename — revisit with a manifest-pointer flip if/when S3 comes into scope.)
  • Windowed reads compose for regional targets. A lat/lon bbox (+buffer) reads only the chunks overlapping a limited-area domain (e.g. WRF) — a read-side saving classic HEMCO doesn't do. For a global mesh every cell is in-domain, so the saving is moot; on an unstructured mesh it depends on a locality-preserving cell ordering (see open questions).
Stored — inside a Tier-2 store · mesh A CAMS-GLOB-ANT — one array per sector power [time × cell] industry [time × cell] transport [time × cell] + residential · shipping · agriculture Vertically resolved — keep a lev axis aircraft [time × lev × cell] volcanic SO₂ [time × lev × cell] Surface — 2-D FINN fire [time × cell] CAMS-GLOB-BIO [time × cell] At read, in the consumer (MIEM) × temporal profile sector-specific (HEMCO-style) × speciation sector-specific → mechanism species Σ aggregate over sectors per-cell flux · per mechanism species surface + elevated (lev) levels
Flux is kept per sector and per species, with a vertical lev axis where the inventory has one (aircraft, volcanic). The consumer (MIEM) interpolates in time, applies the inventory→mechanism species map, and combines sources HEMCO-style at read — never pre-summed at ingest.

2Why a derived store, not just netCDF

netCDF is the right interchange format — but a build-once store that's appended to, partially read, and written in parallel wants a chunked, multi-array layout. The two aren't rivals: the store is Zarr internally and can still export netCDF slices for consumers that need them.

netCDF — monolithic file one .nc file single grid read = whole field for the time slice Zarr v3 — chunked store read = only the chunks in my window
ConcernMonolithic netCDFZarr v3 derived store
Add an inventoryrewrite / unlimited-dim hacksadd an array to a group
Many inventories + meshesone grid per file → many filesgroups in one logical store
Partial readchunked hyperslab (netCDF-4 / HDF5), but whole-file openwindowed per-chunk, object-store-friendly
Parallel writesmature via PnetCDF / HDF5 collective MPI-IOlock-free, sharded, library-light
Object store (S3)weaknative
Interchange ubiquitylingua franca CF, every tool reads itnewer — export netCDF slices for legacy consumers

3Making it model-agnostic

Nothing in the store is MPAS- or MIEM-specific. Tier 2 is keyed on an abstract target-grid descriptor (a SCRIP/ESMF grid + hash) that each per-grid store embeds — sourced from the model's own grid file (MPAS static.nc, WRF geo_em, CAM-FV, …) but stored model-agnostically; the remap engine handles structured, unstructured, and cubed-sphere grids alike. Speciation stays with the consumer, so the store is mechanism-agnostic too.

Inventory library native lat/lon · Tier 1 Target grid descriptors (SCRIP / ESMF) MPAS static.nc · WRF CAM-FV · FV3 cubed-sphere Emissions store (the tool) remap in two parts — weights: ESMF / TempestRemap / UPTEMPO apply + write Zarr + manifest index MPAS-A + MIEM x1.* mesh WRF-Chem structured grid CAM-chem FV / SE grid FV3 / GOCART-2G cubed-sphere Speciation stays with the consumer → the store is mechanism-agnostic too. Any model that can describe its grid (SCRIP) and read per-cell flux uses the same store.
The contract

In: an inventory set (per sector), a target-grid descriptor (SCRIP/ESMF), a time window, a remap method, and the declared units + calendar. Out: a derived store of remapped per-cell flux — kept per sector and per species, with a lev dimension where the inventory has one — plus a manifest carrying provenance and mass-budget diagnostics. The store holds inventory species; mapping to a chemical mechanism is the consumer's job — MIEM applies the configured inventory→mechanism species map at runtime and combines sources HEMCO-style. MIEM is one consumer of many — WRF-Chem, CAM-chem, or a GOCART-2G host could read the same store, directly or via a netCDF export.

4Open questions — build-time decisions

Conceptually settled above; these are implementation choices deferred to the build tool, noted here so they aren't mistaken for solved.

Deferred
  • Single-builder election (thundering herd). "Check-or-build" must keep N concurrent ranks from all rebuilding the same store. Options: rank-0-builds + barrier within an MPI job; an O_CREAT|O_EXCL lockfile / lease with stale-lock recovery across independent jobs. Atomic rename gives isolation, not coordination.
  • Unstructured cell ordering for windowed reads. To make a bbox read selective on a 1-D nCells array, Tier-2 cells would be permuted into a locality-preserving (space-filling-curve) order and chunked along it, with a bbox→chunk index built from the embedded cell coordinates. Moot for global runs.
  • Schema migration. When the store layout or manifest schema changes, the format version drives a documented migration path rather than an undefined read failure.
  • Temporal profiles — store-provided or host-owned? The consumer (MIEM) interpolates in time but does not apply diurnal / weekly / seasonal scaling profiles. Since the store owns the time dimension, it could provide them directly — either carry a base flux + normalized profile arrays (the consumer multiplies) or bake higher-resolution flux at build time. Local-time alignment is cheap: the store already carries each cell's longitude, so the UTC→local offset is just a per-cell attribute (≈ lon/15 h) and diurnal profiles apply in local solar time by convention (no time-zone or DST database). The open part is the policy: store-supplied profiles vs. host-owned.