Emissions tooling · concept

A build-once emissions store

Prepare offline inventories once — conservatively remapped to a model's grid and kept behind a provenance manifest — so the runtime just serves flux. A separate, model-agnostic tool; MIEM (or any host) is simply a consumer.

1The concept

Two tiers and a manifest. Tier 1 is the canonical library of inventories on their native grids — the appendable source of truth. Tier 2 is per-grid data, conservatively remapped once and materialized. The manifest keys derived data to a provenance hash, so the runtime checks-or-builds and then only ever reads.

Scope. Static, offline gridded inventories — anthropogenic, fire, and offline biogenic flux. Online emission schemes (online biogenic, dust, sea-salt, ocean DMS, lightning NO_x) are computed by the host model at runtime from its own meteorology and are out of scope here.

native inventories (Tier 1) remapped / build (Tier 2) manifest runtime consumers

Build once (offline or on first call), serve many (every model step). The remap — the expensive part — happens exactly once per (inventory set × grid × time window).

Design decisions we settled on

Store the data; don't fit profiles. Zarr codecs compress losslessly; reserve a base×temporal-profile form only for inventories natively given that way. Episodic sources (fire, volcanic) have no stable cycle — keep them stored.
Sectors are preserved, not pre-summed. CAMS-GLOB-ANT ships per sector; remap each sector once (weights are shared across sectors on a grid) and keep them separate — so each sector can be its own MIEM source (its own species map, scaling, category/hierarchy, vertical injection) with per-sector diagnostics. Sources are combined HEMCO-style (categories sum; highest hierarchy wins per cell) at read, not pre-summed at ingest.
Vertical structure is preserved. Inventories with an injection dimension — aircraft by flight level, volcanic by altitude — keep a lev coordinate; surface sources stay 2-D. Plume-rise / injection allocation is the consumer's job.
Time is a dimension, not a partition. Slice any window at read; the manifest carries a time-range only as coverage metadata for reuse validity. Units and calendar (e.g. 365-day no-leap vs. real, with a leap-day policy) are fixed per array; temporal application stays with the consumer.
Conservation is explicit and verified. The build conserves the global mass integral (Σ flux×area) and records the area definition, edge/partial-cell normalization, and missing-value / negative-clip policy. Before/after mass-budget diagnostics go in the manifest; a mismatch beyond tolerance fails the build.
Generate weights once, apply many — own the Zarr write. Remapping splits in two: a proven engine (ESMF / TempestRemap / UPTEMPO) generates the conservative weight file — a stored, hashed artifact reused across every sector, species, and time step on that grid pair — and the store applies the weights (a sparse mat-vec, per chunk) and writes Zarr directly. No netCDF→Zarr round-trip: external remappers only emit netCDF, so the store owns the cheap application-and-write half.
Centralize Tier 1 so Tier-2 builds read only what they need. Today a remap reads a whole monolithic inventory file end-to-end — ≈415 MB for one CAMS-GLOB-ANT black-carbon species-year — to emit a ~90 MB remapped file, repeated per species / sector / year. A chunked, centralized Tier 1 lets the build pull only the chunks in scope (time window, species/sectors, and for a limited-area mesh just the domain), compute weights once per source-grid/mesh pair and reuse them across that inventory's species and sectors, stream in parallel, and rebuild incrementally when one piece changes.
Mesh is the build-once unit — descriptor embedded, not referenced. Each target grid gets its own Tier-2 store that embeds the resolved SCRIP/ESMF grid descriptor (cell centers, corners, areas, mask) — sourced from the model's grid file (MPAS static.nc, WRF geo_em, CAM-FV, …) but stored model-agnostically. Its content hash feeds the provenance hash, so a changed mesh forces a rebuild.
The provenance hash is input-based. It keys on schema version + Tier-1 content checksums + grid-descriptor hash + remap method + pinned remap-engine version — not on the float weights (not bitwise-reproducible across decompositions). Reuse validity tracks inputs, not outputs. Store and manifest carry a format version so readers fail loudly on an incompatible store.
Build to temp, atomic rename. On a POSIX filesystem (incl. Lustre / GPFS) a same-filesystem rename is atomic, so readers never see a half-built store. (Object stores have no atomic rename — revisit with a manifest-pointer flip if/when S3 comes into scope.)
Windowed reads compose for regional targets. A lat/lon bbox (+buffer) reads only the chunks overlapping a limited-area domain (e.g. WRF) — a read-side saving classic HEMCO doesn't do. For a global mesh every cell is in-domain, so the saving is moot; on an unstructured mesh it depends on a locality-preserving cell ordering (see open questions).

Flux is kept per sector and per species, with a vertical lev axis where the inventory has one (aircraft, volcanic). The consumer (MIEM) interpolates in time, applies the inventory→mechanism species map, and combines sources HEMCO-style at read — never pre-summed at ingest.

2Why a derived store, not just netCDF

netCDF is the right interchange format — but a build-once store that's appended to, partially read, and written in parallel wants a chunked, multi-array layout. The two aren't rivals: the store is Zarr internally and can still export netCDF slices for consumers that need them.

Concern	Monolithic netCDF	Zarr v3 derived store
Add an inventory	rewrite / unlimited-dim hacks	add an array to a group
Many inventories + meshes	one grid per file → many files	groups in one logical store
Partial read	chunked hyperslab (netCDF-4 / HDF5), but whole-file open	windowed per-chunk, object-store-friendly
Parallel writes	mature via PnetCDF / HDF5 collective MPI-IO	lock-free, sharded, library-light
Object store (S3)	weak	native
Interchange ubiquity	lingua franca CF, every tool reads it	newer — export netCDF slices for legacy consumers

3Making it model-agnostic

Nothing in the store is MPAS- or MIEM-specific. Tier 2 is keyed on an abstract target-grid descriptor (a SCRIP/ESMF grid + hash) that each per-grid store embeds — sourced from the model's own grid file (MPAS static.nc, WRF geo_em, CAM-FV, …) but stored model-agnostically; the remap engine handles structured, unstructured, and cubed-sphere grids alike. Speciation stays with the consumer, so the store is mechanism-agnostic too.

The contract

In: an inventory set (per sector), a target-grid descriptor (SCRIP/ESMF), a time window, a remap method, and the declared units + calendar. Out: a derived store of remapped per-cell flux — kept per sector and per species, with a lev dimension where the inventory has one — plus a manifest carrying provenance and mass-budget diagnostics. The store holds inventory species; mapping to a chemical mechanism is the consumer's job — MIEM applies the configured inventory→mechanism species map at runtime and combines sources HEMCO-style. MIEM is one consumer of many — WRF-Chem, CAM-chem, or a GOCART-2G host could read the same store, directly or via a netCDF export.

4Open questions — build-time decisions

Conceptually settled above; these are implementation choices deferred to the build tool, noted here so they aren't mistaken for solved.

Deferred

Single-builder election (thundering herd). "Check-or-build" must keep N concurrent ranks from all rebuilding the same store. Options: rank-0-builds + barrier within an MPI job; an O_CREAT|O_EXCL lockfile / lease with stale-lock recovery across independent jobs. Atomic rename gives isolation, not coordination.
Unstructured cell ordering for windowed reads. To make a bbox read selective on a 1-D nCells array, Tier-2 cells would be permuted into a locality-preserving (space-filling-curve) order and chunked along it, with a bbox→chunk index built from the embedded cell coordinates. Moot for global runs.
Schema migration. When the store layout or manifest schema changes, the format version drives a documented migration path rather than an undefined read failure.
Temporal profiles — store-provided or host-owned? The consumer (MIEM) interpolates in time but does not apply diurnal / weekly / seasonal scaling profiles. Since the store owns the time dimension, it could provide them directly — either carry a base flux + normalized profile arrays (the consumer multiplies) or bake higher-resolution flux at build time. Local-time alignment is cheap: the store already carries each cell's longitude, so the UTC→local offset is just a per-cell attribute (≈ lon/15 h) and diurnal profiles apply in local solar time by convention (no time-zone or DST database). The open part is the policy: store-supplied profiles vs. host-owned.