The emissions store, explained for scientists
You know inventories, conservative remapping, and how to run chemistry models — and you know netCDF and GRIB. This covers the parts that are new: what the store's two tiers actually hold, what Zarr v3 adds over the formats you already use, and how it bears on real concerns like global runs and reproducibility. It doesn't change what you control: at run time MIEM still applies the emissions — interpolating in time, applying your inventory→mechanism species map, and combining sources HEMCO-style.
1What "Tier 1" and "Tier 2" actually are
Two layers and a record. The names are just before remap and after remap — plus the bookkeeping that connects them.
Tier 1The inventory library, on native grids
The inventories you already use — CAMS-GLOB-ANT, FINN, CAMS-GLOB-BIO — stored as received, on their native lat/lon, per sector or category, and appendable. Nothing is remapped here. It's the canonical, versioned source of truth: one centralized place that holds the raw inventories — and, as the next section shows, that centralization is exactly what makes building Tier 2 fast.
Tier 2Those inventories conservatively remapped onto your mesh — once
When you run on a given mesh (say a specific MPAS mesh), the store produces — the first time only — a Tier-2 copy: each inventory conservatively remapped to that mesh and saved as per-cell flux, kept per sector and species (and per level where the inventory has one, e.g. aircraft or volcanic). There is one Tier-2 store per target grid. This is what your run reads, every step — with no remap at run time.
ManifestThe record that ties Tier 2 back to Tier 1
A small record noting exactly which Tier-1 inventories (and versions), which grid, and which remap method produced a Tier-2 product. It's how the store knows whether an existing remap can be reused or must be rebuilt — and it's the provenance you can cite in a paper.
2Why a centralized Tier 1
Before it speeds anything up, Tier 1 fixes a more basic problem: where the inventories live, and whether you can trust which version you're using.
Today the same inventories tend to live as scattered files — copies on scratch, per project and per user, each processed by hand, with the provenance (which version, what scaling, which year) carried in scripts and people's memory. Tier 1 replaces that with one centralized, versioned library on native grids — the canonical source every build draws from.
- One source of truth. Everyone builds from the same inventories — no quietly divergent copies.
- Provenance you can cite. Each inventory carries its version (and DOI), so a result traces back to exactly what produced it.
- Append-only. Add a new inventory or year without touching or rewriting what's already there.
- The precondition for fast builds. Because it's one chunked store, building Tier 2 can read just what it needs — the next section.
3What creating Tier 2 looks like — today vs. a centralized store
This is where centralization pays off. Tier 1 being one chunked store — rather than scattered, whole-file inventories — is what makes building Tier 2 fast.
Today, each remapped product is produced by reading an inventory file end to end. For a single species-year of CAMS-GLOB-ANT black carbon that's about 415 MB read to write a ~90 MB remapped file — and every other species, sector, and year is another full pass, with the remap set up again each time. The cost scales with the number of files, not with how much actually changed.
With a centralized, chunked Tier 1 the build reads only the chunks it needs, computes the conservative weights once per source-grid/mesh pair, and reuses them across every species and sector on that native grid:
| Building Tier 2 | Today (UPTEMPO) | Centralized Zarr v3 store |
|---|---|---|
| Source | Separate whole-file inventories, per species / year | One centralized, chunked Tier-1 store |
| Read to build | Entire file end-to-end (≈415 MB for one BC species-year) | Only the chunks needed — time window, species / sectors, and (regional mesh) just the cells near your domain |
| Remap setup | Repeated per product | Weights computed once per source-grid → mesh pair (per method), reused across every species & sector on that grid (all of CAMS-GLOB-ANT); a different native grid (FINN, CAMS-GLOB-BIO) gets its own |
| Execution | Largely serial, whole arrays in memory | Streamed and parallel |
| One piece changes | Re-run the file | Rebuild just that piece (incremental) |
| Output | A 90 MB remapped file | Chunked Tier 2 — the model later reads only its slice |
Pulling from one consistent, chunked source — instead of reassembling scattered monolithic files — is exactly what makes partial, parallel, and incremental builds possible. For a global mesh you still cover every source cell once, but you read each array a single time, weight it once, and reuse that across all of CAMS-GLOB-ANT's species and sectors — rather than a separate full-file pass and remap setup for each.
4What Zarr v3 is — starting from what you already know
First, what it isn't: Zarr v3 is not another binary file format competing with netCDF or GRIB. It's a storage layout — a directory (or cloud-bucket prefix) of independently compressed array chunks plus a small JSON metadata file, not a single binary blob you open. The arrays inside are the same kind netCDF holds; only the on-disk organization differs.
Zarr stores the same thing netCDF does: named, dimensioned, CF-described arrays. The difference is the layout — a set of independently compressed chunks plus a little metadata, rather than one monolithic file. If you've used netCDF-4/HDF5 chunking, it's that idea taken to its conclusion: each chunk is a separate object on disk.
That layout is the whole reason it's used here — a store that's built once and read on every model step wants exactly the things separate chunks give you, and that a single big file makes awkward:
- Parallel build. Many ranks write different chunks at the same time, with no central library or file lock — so remapping a fine mesh, or many inventories, parallelizes on HPC.
- Appendable. Add a new inventory, sector, or year by writing new arrays/chunks — without rewriting what's already there.
- Read only what you need. A run pulls the time slice it's interpolating to, and each rank reads its own cells, without opening a whole monolithic file.
- Object-store native. It works directly on S3-style storage when you want that — not required, but available.
- v3 adds sharding. Many small chunks get bundled into a few files. That matters because a fine MPAS mesh can otherwise produce an enormous number of tiny chunks, which strains HPC/Lustre file counts.
It's still CF-style metadata and named dimensions, still lossless, and still exports to netCDF for any tool or colleague that wants a file. Zarr isn't a new science format — it's a storage layout chosen for the build-once / serve-many access pattern. The numbers are identical to what a netCDF version would hold.
5Where netCDF and GRIB fit (and why the store is neither)
You already work in both — the point isn't that one is better, it's that each was built for a different job, and the build-once store is a third job.
netCDF stays the interchange and export format: the store can emit netCDF slices for anything downstream, so nothing you hand off changes. It just isn't the working store, because a single growing, parallel-written, partially-read file is awkward to maintain. GRIB is built for compact, standardized operational meteorology — one message per field, mostly structured grids, commonly lossy bit-packing, code-table metadata. Excellent for moving and archiving weather; a poor fit for an appendable, lossless, unstructured-grid, random-access emissions store. If an inventory arrives as GRIB you read it in — you just wouldn't store it that way.
| netCDF (CF) | GRIB2 | Zarr v3 | |
|---|---|---|---|
| Primary job | Interchange & archive | Operational met transport | Build-once store (here) |
| Layout | One file (HDF5) | Stream of messages, one field each | Chunked objects + sharding |
| Grids | Any (CF / UGRID) | Mainly structured; unstructured awkward | Any |
| Precision | Lossless | Commonly lossy (packed) | Lossless |
| Partial access | Hyperslab within a file | Per message | per chunk — time / rank |
| Parallel writes | HDF5 / PnetCDF collective I/O | n/a (it's a stream) | Chunk-parallel, no central lock |
| Metadata | Rich, self-describing | Numeric code tables | Rich, self-describing |
| Role in this system | Export format | Read on input if needed | the store itself |
Useful to know: tools like cfgrib and
kerchunk can index a pile of GRIB (or netCDF) so that xarray reads it as a Zarr-like virtual dataset — handy
for pulling from legacy weather archives without converting them. That's a read-side bridge for inputs, not a reason to
store the data in GRIB.
6How it helps — and what stays in your hands
Global runs. A global mesh needs every cell, throughout the run, so this isn't about reading a spatial subset. The win is that the conservative remap onto your global mesh is done once, with a verified global mass budget behind it, and is then identical for every run and every rank — instead of being re-derived (and potentially re-derived differently) per run. At run time, each rank reads its own partition's cells in parallel, and only the time window it needs.
- Reproducibility. A multi-year run, a re-run six months later, and a colleague's run on the same mesh all read byte-identical emissions — with the manifest recording exactly what produced them.
- New inventories, meshes, sectors. Adding one is cheap and doesn't disturb existing data; a new mesh triggers one remap, then is stored like the rest.
- Across models. MPAS-A, WRF-Chem, and CAM-chem can read from the same conservatively remapped source, rather than each maintaining its own emissions prep.
- Conservation is checked. The build records a before/after mass budget; a remap that doesn't conserve to tolerance fails rather than silently biasing your totals.
The store only owns the offline remap, the storage, and the provenance. At run time MIEM applies the emissions — it interpolates in time to the model step, applies the configured inventory→mechanism species map (with per-source scaling), and combines sources HEMCO-style (categories sum; within a category the highest hierarchy wins per cell). It does not remap, and it is not a chemistry solver. The store hands MIEM remapped inventory species per sector; the inventory→mechanism mapping is authored upstream (in MechanismConfiguration, resolved by musica) and applied by MIEM — still yours to control.
One thing this makes possible: temporal profiles. MIEM interpolates in time but doesn't apply diurnal / weekly / seasonal scaling. Because the store owns the time dimension — and already carries each cell's longitude — it could supply those directly: a base flux × a normalized profile, applied in local solar time (just a per-cell offset, ≈ lon/15 h — no time-zone database). It's a possibility the design enables, not yet decided (see the concept's open questions).
7Bottom line
Tier 1 is your inventories as they come; Tier 2 is those same inventories conservatively remapped onto your mesh, computed once and reused; the manifest is the provenance that links them. Zarr v3 is just the storage layout that makes build-once / serve-many cheap — same CF arrays, lossless, still exportable to netCDF. netCDF stays the way you share data and GRIB stays a weather-side input format; neither is the store. The payoff is remap-once economics, reproducible global runs, and one emissions source across the models that read it — without changing the chemistry you already control.
Full design, trade-offs, and the open build-time questions: A build-once emissions store (technical) →