Emissions store · for scientists

The emissions store, explained for scientists

You know inventories, conservative remapping, and how to run chemistry models — and you know netCDF and GRIB. This covers the parts that are new: what the store's two tiers actually hold, what Zarr v3 adds over the formats you already use, and how it bears on real concerns like global runs and reproducibility. It doesn't change what you control: at run time MIEM still applies the emissions — interpolating in time, applying your inventory→mechanism species map, and combining sources HEMCO-style.

1What "Tier 1" and "Tier 2" actually are

Two layers and a record. The names are just before remap and after remap — plus the bookkeeping that connects them.

Tier 1The inventory library, on native grids

The inventories you already use — CAMS-GLOB-ANT, FINN, CAMS-GLOB-BIO — stored as received, on their native lat/lon, per sector or category, and appendable. Nothing is remapped here. It's the canonical, versioned source of truth: one centralized place that holds the raw inventories — and, as the next section shows, that centralization is exactly what makes building Tier 2 fast.

Tier 2Those inventories conservatively remapped onto your mesh — once

When you run on a given mesh (say a specific MPAS mesh), the store produces — the first time only — a Tier-2 copy: each inventory conservatively remapped to that mesh and saved as per-cell flux, kept per sector and species (and per level where the inventory has one, e.g. aircraft or volcanic). There is one Tier-2 store per target grid. This is what your run reads, every step — with no remap at run time.

ManifestThe record that ties Tier 2 back to Tier 1

A small record noting exactly which Tier-1 inventories (and versions), which grid, and which remap method produced a Tier-2 product. It's how the store knows whether an existing remap can be reused or must be rebuilt — and it's the provenance you can cite in a paper.

Tier 1 — inventory library CAMS-GLOB-ANT · FINN · CAMS-GLOB-BIO native lat/lon · per sector as received · append-only the raw inventories Tier 2 — on your mesh each inventory remapped to your mesh per-cell flux · per sector / species built once · served many what your run reads Your run (via MIEM) reads flux every step remap once conservative Manifest · provenance + reuse key
First run on a new mesh: the store remaps the Tier-1 inventories onto it and records how. Every later run on that mesh — yours or a colleague's — just reads the Tier-2 result.

2Why a centralized Tier 1

Before it speeds anything up, Tier 1 fixes a more basic problem: where the inventories live, and whether you can trust which version you're using.

Today the same inventories tend to live as scattered files — copies on scratch, per project and per user, each processed by hand, with the provenance (which version, what scaling, which year) carried in scripts and people's memory. Tier 1 replaces that with one centralized, versioned library on native grids — the canonical source every build draws from.

Today — scattered copies CAMS-GLOB-ANT_2012.nc CAMS…_2012_copy.nc FINN_2012*.nc /scratch/userA/emis/… /scratch/userB/emis_v2/… ? ? duplicated · processed ad-hoc · provenance in scripts & memory With Tier 1 — centralized Tier 1 — inventory library one source of truth · native grids versioned + DOI · provenance attached append-only — add without touching the rest every build draws from here ✓ one copy    ✓ provenance    ✓ appendable
Tier 1 turns scattered, hand-managed copies into one canonical, versioned library — the same starting point for every run and every build.
  • One source of truth. Everyone builds from the same inventories — no quietly divergent copies.
  • Provenance you can cite. Each inventory carries its version (and DOI), so a result traces back to exactly what produced it.
  • Append-only. Add a new inventory or year without touching or rewriting what's already there.
  • The precondition for fast builds. Because it's one chunked store, building Tier 2 can read just what it needs — the next section.

3What creating Tier 2 looks like — today vs. a centralized store

This is where centralization pays off. Tier 1 being one chunked store — rather than scattered, whole-file inventories — is what makes building Tier 2 fast.

Today, each remapped product is produced by reading an inventory file end to end. For a single species-year of CAMS-GLOB-ANT black carbon that's about 415 MB read to write a ~90 MB remapped file — and every other species, sector, and year is another full pass, with the remap set up again each time. The cost scales with the number of files, not with how much actually changed.

Today (UPTEMPO) CAMS BC · full year global · 415 MB read the whole file remap 90 MB on mesh × every species · sector · year remap set up again each time whole-file reads, repeated per product With centralized Tier 1 Tier 1 (chunked) read only needed chunks weights · computed once applied to every species Tier 2 — built fast read only what's needed · weights once · apply many · parallel
The expensive part of building Tier 2 — reading source data and setting up the remap — shrinks when Tier 1 is one chunked store: read just the chunks in scope, compute the conservative weights once, and apply them to every species and sector.

With a centralized, chunked Tier 1 the build reads only the chunks it needs, computes the conservative weights once per source-grid/mesh pair, and reuses them across every species and sector on that native grid:

Building Tier 2Today (UPTEMPO)Centralized Zarr v3 store
SourceSeparate whole-file inventories, per species / yearOne centralized, chunked Tier-1 store
Read to buildEntire file end-to-end (≈415 MB for one BC species-year)Only the chunks needed — time window, species / sectors, and (regional mesh) just the cells near your domain
Remap setupRepeated per productWeights computed once per source-grid → mesh pair (per method), reused across every species & sector on that grid (all of CAMS-GLOB-ANT); a different native grid (FINN, CAMS-GLOB-BIO) gets its own
ExecutionLargely serial, whole arrays in memoryStreamed and parallel
One piece changesRe-run the fileRebuild just that piece (incremental)
OutputA 90 MB remapped fileChunked Tier 2 — the model later reads only its slice
Why "centralized" matters

Pulling from one consistent, chunked source — instead of reassembling scattered monolithic files — is exactly what makes partial, parallel, and incremental builds possible. For a global mesh you still cover every source cell once, but you read each array a single time, weight it once, and reuse that across all of CAMS-GLOB-ANT's species and sectors — rather than a separate full-file pass and remap setup for each.

4What Zarr v3 is — starting from what you already know

First, what it isn't: Zarr v3 is not another binary file format competing with netCDF or GRIB. It's a storage layout — a directory (or cloud-bucket prefix) of independently compressed array chunks plus a small JSON metadata file, not a single binary blob you open. The arrays inside are the same kind netCDF holds; only the on-disk organization differs.

Zarr stores the same thing netCDF does: named, dimensioned, CF-described arrays. The difference is the layout — a set of independently compressed chunks plus a little metadata, rather than one monolithic file. If you've used netCDF-4/HDF5 chunking, it's that idea taken to its conclusion: each chunk is a separate object on disk.

That layout is the whole reason it's used here — a store that's built once and read on every model step wants exactly the things separate chunks give you, and that a single big file makes awkward:

  • Parallel build. Many ranks write different chunks at the same time, with no central library or file lock — so remapping a fine mesh, or many inventories, parallelizes on HPC.
  • Appendable. Add a new inventory, sector, or year by writing new arrays/chunks — without rewriting what's already there.
  • Read only what you need. A run pulls the time slice it's interpolating to, and each rank reads its own cells, without opening a whole monolithic file.
  • Object-store native. It works directly on S3-style storage when you want that — not required, but available.
  • v3 adds sharding. Many small chunks get bundled into a few files. That matters because a fine MPAS mesh can otherwise produce an enormous number of tiny chunks, which strains HPC/Lustre file counts.
What it does not change

It's still CF-style metadata and named dimensions, still lossless, and still exports to netCDF for any tool or colleague that wants a file. Zarr isn't a new science format — it's a storage layout chosen for the build-once / serve-many access pattern. The numbers are identical to what a netCDF version would hold.

5Where netCDF and GRIB fit (and why the store is neither)

You already work in both — the point isn't that one is better, it's that each was built for a different job, and the build-once store is a third job.

netCDF stays the interchange and export format: the store can emit netCDF slices for anything downstream, so nothing you hand off changes. It just isn't the working store, because a single growing, parallel-written, partially-read file is awkward to maintain. GRIB is built for compact, standardized operational meteorology — one message per field, mostly structured grids, commonly lossy bit-packing, code-table metadata. Excellent for moving and archiving weather; a poor fit for an appendable, lossless, unstructured-grid, random-access emissions store. If an inventory arrives as GRIB you read it in — you just wouldn't store it that way.

 netCDF (CF)GRIB2Zarr v3
Primary jobInterchange & archiveOperational met transportBuild-once store (here)
LayoutOne file (HDF5)Stream of messages, one field eachChunked objects + sharding
GridsAny (CF / UGRID)Mainly structured; unstructured awkwardAny
PrecisionLosslessCommonly lossy (packed)Lossless
Partial accessHyperslab within a filePer messageper chunk — time / rank
Parallel writesHDF5 / PnetCDF collective I/On/a (it's a stream)Chunk-parallel, no central lock
MetadataRich, self-describingNumeric code tablesRich, self-describing
Role in this systemExport formatRead on input if neededthe store itself

Useful to know: tools like cfgrib and kerchunk can index a pile of GRIB (or netCDF) so that xarray reads it as a Zarr-like virtual dataset — handy for pulling from legacy weather archives without converting them. That's a read-side bridge for inputs, not a reason to store the data in GRIB.

6How it helps — and what stays in your hands

Global runs. A global mesh needs every cell, throughout the run, so this isn't about reading a spatial subset. The win is that the conservative remap onto your global mesh is done once, with a verified global mass budget behind it, and is then identical for every run and every rank — instead of being re-derived (and potentially re-derived differently) per run. At run time, each rank reads its own partition's cells in parallel, and only the time window it needs.

  • Reproducibility. A multi-year run, a re-run six months later, and a colleague's run on the same mesh all read byte-identical emissions — with the manifest recording exactly what produced them.
  • New inventories, meshes, sectors. Adding one is cheap and doesn't disturb existing data; a new mesh triggers one remap, then is stored like the rest.
  • Across models. MPAS-A, WRF-Chem, and CAM-chem can read from the same conservatively remapped source, rather than each maintaining its own emissions prep.
  • Conservation is checked. The build records a before/after mass budget; a remap that doesn't conserve to tolerance fails rather than silently biasing your totals.
What does not change

The store only owns the offline remap, the storage, and the provenance. At run time MIEM applies the emissions — it interpolates in time to the model step, applies the configured inventory→mechanism species map (with per-source scaling), and combines sources HEMCO-style (categories sum; within a category the highest hierarchy wins per cell). It does not remap, and it is not a chemistry solver. The store hands MIEM remapped inventory species per sector; the inventory→mechanism mapping is authored upstream (in MechanismConfiguration, resolved by musica) and applied by MIEM — still yours to control.

One thing this makes possible: temporal profiles. MIEM interpolates in time but doesn't apply diurnal / weekly / seasonal scaling. Because the store owns the time dimension — and already carries each cell's longitude — it could supply those directly: a base flux × a normalized profile, applied in local solar time (just a per-cell offset, ≈ lon/15 h — no time-zone database). It's a possibility the design enables, not yet decided (see the concept's open questions).

monthly base flat in time × diurnal profile · local solar time = time-resolved flux Applied in local solar time — on a mesh that's just a per-cell offset (≈ longitude/15 h), not a time-zone lookup. A possibility the design enables; not yet decided.

7Bottom line

Tier 1 is your inventories as they come; Tier 2 is those same inventories conservatively remapped onto your mesh, computed once and reused; the manifest is the provenance that links them. Zarr v3 is just the storage layout that makes build-once / serve-many cheap — same CF arrays, lossless, still exportable to netCDF. netCDF stays the way you share data and GRIB stays a weather-side input format; neither is the store. The payoff is remap-once economics, reproducible global runs, and one emissions source across the models that read it — without changing the chemistry you already control.

Full design, trade-offs, and the open build-time questions: A build-once emissions store (technical) →