Data sources

Three sources are fetched into the same schema. When the same observation appears in more than one, a deterministic client-side dedup tiebreak — applied in your process, not a hosted ranked record — picks one survivor, in this order:

1
priority  source  origin                              when it's rawest
2
────────  ──────  ──────────────────────────────────  ──────────────────────
3
   3      AWC     aviationweather.gov                 Live feed, sensor output
4
   2      IEM     Iowa Environmental Mesonet          Near-real-time archive
5
   1      GHCNh   NOAA hourly archive                 Historical, quality-flagged

For any given (station, observed_at), the rawest source is the dedup survivor. First-reported survives within a source. One row per observation after the tiebreak, no duplicates, no silent overwrites.

01 · Why this order

AWC is the sensor. When an airport publishes a METAR, AWC has it within seconds. The string stored in raw_metar is the exact text the station broadcast. Nothing between the thermistor and you.

IEM is a trustworthy mirror. The Iowa Environmental Mesonet re-publishes METARs from NOAA’s feed. Timing is near-real-time (minutes, not seconds) and the metar column contains the same raw string — we parse it ourselves rather than trusting IEM’s pre-decoded columns, so the transform is identical to AWC’s.

GHCNh is the quality-flagged archive. NOAA’s Global Historical Climatology Network – hourly is authoritative for historical analysis, but it’s downstream of both AWC and IEM. It carries quality codes (0/1/4/5 good; I/P/R/U flagged) — we filter to raw-only. If there’s a disagreement between GHCNh and AWC, GHCNh is the one that’s been through someone’s quality pipeline. We prefer the rawer number.

02 · Raw-as-reported, no corrections

The first observation wins. Forever.

If a station publishes a METAR at 14:51Z saying the temperature is 52°F, and at 15:03Z publishes a correction saying it was actually 51°F — the 52°F observation is what the SDK returns. It does not substitute the later correction.

This is counterintuitive for anyone coming from traditional data engineering, where data quality means “fix errors as soon as you find them.” In settlement, data quality means “reproduce the record the market was trading on.”

When a Kalshi weather contract settled based on a 14:51Z METAR, it settled on 52°F — not 51°F. If your backtest data contains 51°F, you are modeling a market that never existed.

The rule: dedup comparison uses strict >, not >=. A later report never overwrites an earlier one with the same observation timestamp.

03 · The three-layer dedup tiebreak in practice

When fetched feeds overlap, the dedup runs in your process. Imagine NYC at 2024-07-15 14:51Z, temperature 85°F.

AWC (tiebreak 3) publishes it at 14:51:12Z, within seconds of the station broadcast. Survives.
IEM (tiebreak 2) publishes the same METAR at 14:53:00Z. Deduped — AWC already covers it.
GHCNh (tiebreak 1) eventually lands in the monthly upstream archive, possibly with a quality flag. Deduped.

Now imagine a network hiccup where AWC never receives that METAR.

AWC has nothing for 14:51Z.
IEM publishes at 14:53:00Z. Survives by default.
GHCNh lands later. Deduped.

Or the worst case — the station’s outbound link was down for an hour.

Only GHCNh carries the record, weeks later.
GHCNh survives by default, with the source: "GHCNh" field letting callers see the source.

You always know which source produced a row by looking at source — never trust that all rows for a station are from the same source.

04 · Downstream implications

as_of queries are safe. When you filter observed_at <= as_of, you are asking “what did the market know at this time?” — and because we don’t overwrite with later corrections, the answer is stable.
Caching is safe. A row with (station, observed_at) never changes. You can cache forever.
Deterministic versioning works. client.data_version() returns a token derived from the observation set. Same token → same data, across SDK versions, across machines.

05 · Forecast sources (a separate layer)

Observations are what happened; forecasts are what was predicted, and each forecast row carries its own source identity, separate from observations. Settlement never touches a forecast — but training and feature engineering do, which is exactly where leakage hides.

Two forecast sources ship today:

1
source     origin                          coverage           leakage model
2
─────────  ──────────────────────────────  ─────────────────  ─────────────────────
3
IEM MOS    Iowa Environmental Mesonet      US MOS stations    run-stamped per row
4
Open-Meteo open-meteo.com (36 NWP models)  global, any point  per-cycle issued_at

06 · Open-Meteo — 36 models, leakage-safe by construction

Open-Meteo aggregates 36 numerical-weather-prediction models across providers — NCEP (GFS/HRRR), ECMWF (IFS/AIFS), DWD (ICON), Météo-France (ARPEGE/AROME), JMA / KMA / CMA / BoM, UKMO, MET Norway, and GEM Canada. You pick the model; the SDK fetches it directly, no proxy.

Training vs. live modes. The SDK never blends model cycles silently:

Training (previous_runs, single_run) — fetches a specific historical model cycle. Every row carries the issued_at of the cycle that produced it, so a feature built at decision-time T can be filtered to issued_at ≤ T with no future leakage.
Live (live) — the latest available cycle, with issued_at derived by a conservative cycle-math lower bound (floor_to_cycle(now − publish_lag(model))). When in doubt it rounds earlier, never later.

The issued_at guarantee. Every forecast row records the model run that produced it. The one endpoint that cannot honor this — Open-Meteo’s Historical Forecast (seamless) feed, which stitches multiple cycles into one continuous series — is banned for training data. Requesting it raises OpenMeteoSeamlessLeakageError unless you explicitly pass allow_leakage=True, and even then the leakage detector rejects the rows once an as_of cutoff is asserted. This traces back to a real incident (Tarabcak/mostlyright#70) where a seamless feed silently fed post-snapshot model runs into training data and manufactured an apparent — but non-reproducible — edge.

Source identity is explicit on every row: open_meteo.previous_runs, open_meteo.single_run, or open_meteo.live. You always know which cycle a number came from.

Raw-as-reported — the rule in depth.
Observation schema — the 30-field contract.
Forecast sources reference — IEM MOS vs. NWP gridded vs. Open-Meteo, model-by-model, with quickstart examples.