Skip to content

Source identity

Three providers carry the same METAR: AWC for the live feed, IEM ASOS for the indexed archive, GHCNh for the long historical tail. They publish on different cadences with different corrections, so two strategies that look identical on paper can diverge in production because one trained on IEM-as-of-yesterday and the other infers from AWC-as-of-right-now.

mostlyrightmd makes the choice explicit. Mode 1 is the v0.14.1-compatible fused output. Mode 2 pins observations to a single source. Both stamp the source identity on every row so a Mode 1 training set cannot silently feed a Mode 2 inference path — or vice versa — without raising.

The default research() call returns Mode 1 output: one row per LST settlement date with obs_high_f / obs_low_f computed across the merged METAR feed. The merge picks the highest-priority survivor on tie, with priority AWC > IEM > GHCNh and first-seen wins at equal priority. This is byte-equivalent to mostlyright==0.14.1’s client.pairs() and is the right choice for settlement reproduction.

import mostlyright
df = mostlyright.research("KNYC", "2025-01-06", "2025-01-12")
# obs_high_f, obs_low_f come from the AWC>IEM>GHCNh fused merge

Mode 1 is the right choice when:

  • You are reproducing a historical Kalshi settlement.
  • You want maximum coverage (GHCNh fills gaps where IEM ASOS dropped a tick).
  • You are training a model on the same fused tape every other production model trained on.

For strategies where source-identity matters at trade time, pin the source. Mode 2 returns raw observation rows from a single named feed and stamps df.attrs["source"] so the validator catches downstream mismatches.

from mostlyright.mode2 import research_by_source
df = research_by_source(
station="KNYC",
source="iem.archive",
from_date="2025-01-06",
to_date="2025-01-12",
)
# Every row's `source` is "iem.archive" or "iem"
# df.attrs["source"] == "iem.archive"

The source vocabulary in v0.1.0:

SourceDescription
iemIEM ASOS (parser-emitted tag, archive + live both)
iem.archiveIEM ASOS archive (canonical identity for backfilled rows)
iem.liveIEM ASOS live tick
awcAWC METAR (parser-emitted tag)
awc.liveAWC live tick
ghcnhNCEI GHCNh
ghcnh.archiveNCEI GHCNh archive (canonical identity)

Mode 2 accepts either form on input. The per-row source overlay column carries the parser-emitted tag (iem, awc, ghcnh) so downstream validators see the truthful provenance, not a rewritten alias.

The validator stamps the schema’s expected source at registration time and compares every DataFrame’s df.attrs["source"] against it. If you build a training set from iem.archive and then accidentally route the result through a schema whose registered source is ghcnh.archive, the validator raises SourceMismatchError instead of silently joining rows that should never have met.

from mostlyright.core import validate_dataframe, SourceMismatchError
from mostlyright.mode2 import research_by_source
df = research_by_source("KNYC", "iem.archive", "2025-01-06", "2025-01-12")
# Validator pins the expected source from the schema:
try:
result = validate_dataframe(df, schema_id="schema.observation.v1")
except SourceMismatchError as err:
print(err.schema_source) # what the schema expected
print(err.data_source) # what the DataFrame actually carries
print(err.role) # "observations" | "forecasts" | "settlement"

To opt out (e.g. for a one-time manual backfill that intentionally crosses sources), pass allow_source_drift="<reason>":

result = validate_dataframe(
df,
schema_id="schema.observation.v1",
allow_source_drift="manual backfill from GHCNh into the IEM schema",
)
# audit_log carries both "registered" + "source_drift_allowed" entries

The opt-out is intentionally verbose: the reason string is required, and it shows up in the audit log so a code review can spot it.

v0.1.0 limitation — Mode 2 reads through the Mode 1 merge

Section titled “v0.1.0 limitation — Mode 2 reads through the Mode 1 merge”

In v0.1.0, research_by_source() calls the same observation fetcher Mode 1 uses, which applies the AWC > IEM > GHCNh merge before returning rows. A Mode 2 caller asking for iem.archive over a window where AWC also has data sees only the IEM rows that AWC did not supersede — not the full IEM coverage of the upstream feed.

If you need full single-source coverage today (before v0.2 adds the pre-merge isolated path), call the per-source fetchers in mostlyright.weather._fetchers directly and skip the catalog layer.

  • Temporal safetyKnowledgeView and assert_no_leakage.
  • Legacy migration — the legacy mostlyright package on PyPI had no source-identity contract; switching to mostlyrightmd is the first time these guards become available.