Source identity

Three providers carry the same METAR: AWC for the live feed, IEM ASOS for the indexed archive, GHCNh for the long historical tail. They publish on different cadences with different corrections, so two strategies that look identical on paper can diverge in production because one trained on IEM-as-of-yesterday and the other infers from AWC-as-of-right-now.

mostlyrightmd makes the choice explicit through one kwarg contract. source= always means provenance — WHO produced the row — and nothing else. The default source-blind path fuses across providers for a single settlement-reproduction view; a source= pin selects a single named provenance. Both stamp the source identity on every row so a fused training set cannot silently feed a source-pinned inference path — or vice versa — without raising.

The default dataset() call returns fused output: one row per LST settlement date with obs_high_f / obs_low_f computed across the merged METAR feed. When the fetched feeds overlap, a deterministic client-side dedup tiebreak runs in your process to keep one row, with order AWC > IEM > GHCNh and first-seen kept at equal rank. This is the right choice for settlement reproduction. source=None (the default) means best available.

Python
TypeScript

1
import mostlyright
2

3
df = mostlyright.dataset("KNYC", "2025-01-06", "2025-01-12")
4
# obs_high_f, obs_low_f come from the AWC > IEM > GHCNh fused merge

1
import { dataset } from "mostlyright";
2

3
const rows = await dataset("KNYC", "2025-01-06", "2025-01-12");
4
// obs_high_f, obs_low_f come from the AWC > IEM > GHCNh fused merge

The fused path is the right choice when:

You are reproducing a historical Kalshi settlement.
You want maximum coverage (GHCNh fills gaps where IEM ASOS dropped a tick).
You are training a model on the same fused tape every other production model trained on.

The `source=` pin — single provenance

For strategies where source-identity matters at trade time, pin the source. A source= pin selects raw observation rows from a single named feed. On obs(), pass granularity="observation" for the per-report rows.

Python
TypeScript

1
from mostlyright.weather import obs
2

3
df = obs(
4
    "KNYC", "2025-01-06", "2025-01-12",
5
    source="iem", granularity="observation",
6
)
7
# Every row's `source` is the truthful parser-emitted tag "iem"

1
import { obs } from "@mostlyrightmd/weather";
2

3
const rows = await obs("KNYC", "2025-01-06", "2025-01-12", {
4
  source: "iem",
5
  granularity: "observation", // TS default is already "observation"
6
});
7
// Every row's `source` is the truthful parser-emitted tag "iem"

The source vocabulary:

Source	Description
`iem`	IEM ASOS (parser-emitted tag, archive + live both)
`iem.archive`	IEM ASOS archive (canonical identity for backfilled rows)
`iem.live`	IEM ASOS live tick
`awc`	AWC METAR (parser-emitted tag)
`awc.live`	AWC live tick
`ghcnh`	NCEI GHCNh
`ghcnh.archive`	NCEI GHCNh archive (canonical identity)

A pin accepts either form on input. The per-row source overlay column carries the parser-emitted tag (iem, awc, ghcnh) so downstream validators see the truthful provenance, not a rewritten alias.

Passing a source the vertical does not recognize raises a loud typed error immediately, before any network call — never a silent fallback to a different provider.

`delivery=` is a separate axis

source= (provenance) and delivery= (where the computation runs — local default, or the opt-in hosted seam) are strictly orthogonal. Hosted rows are byte-identical to the local path. A vertical that overloads source= to mean transport is a bug.

axis	question it answers	example values
`source=`	WHO produced it	`None`, `"awc"`, `"iem"`, `"noaa_goes"`
`delivery=`	WHERE it is computed	`"local"`, `"hosted"`

Frame-vs-row identity

There are two source tags, and they mean different things:

Per-row source — the truthful, durable parser/provider tag on each row (awc / iem / ghcnh). It survives every DataFrame operation because it is a real column.
Frame-level df.attrs["source"] — a pandas-experimental, fragile tag on the whole frame. attrs is dropped by concat / merge / groupby, so treat it as advisory, not load-bearing. (TypeScript carries the equivalent identity as a non-enumerable frameSource property on the returned array.)

The fused observation frame carries the frame tag merged.live_v1 — a distinct merged-policy identity, not any single provider’s name. merged.live_v1 is rejected by every single-source pinned schema (schema.observation.v1 raises SourceMismatchError) and is accepted only by schema.observation.merged.v1, which validates the per-row source as membership in {awc, iem, ghcnh}. This is what stops a fused frame from silently passing through a pinned-source validator.

SourceMismatchError — the loud guard

The validator stamps the schema’s expected source at registration time and compares every DataFrame’s carried identity against it. If you build a training set from iem and then accidentally route the result through a schema whose registered source is ghcnh, the validator raises SourceMismatchError instead of silently joining rows that should never have met.

Python
TypeScript

1
from mostlyright.core import validate_dataframe, SourceMismatchError
2
from mostlyright.weather import obs
3

4
df = obs("KNYC", "2025-01-06", "2025-01-12",
5
         source="iem", granularity="observation")
6

7
# Validator pins the expected source from the schema:
8
try:
9
    result = validate_dataframe(df, schema_id="schema.observation.v1")
10
except SourceMismatchError as err:
11
    print(err.schema_source)   # what the schema expected
12
    print(err.data_source)     # what the DataFrame actually carries
13
    print(err.role)            # "observations" | "forecasts" | "settlement"

1
import { validateRows } from "@mostlyrightmd/core/validator";
2
import { SourceMismatchError } from "@mostlyrightmd/core";
3
import { obs } from "@mostlyrightmd/weather";
4

5
const rows = await obs("KNYC", "2025-01-06", "2025-01-12", { source: "iem" });
6

7
try {
8
  const result = validateRows(rows, "schema.observation.v1");
9
} catch (err) {
10
  if (err instanceof SourceMismatchError) {
11
    console.log(err.schemaSource);  // what the schema expected
12
    console.log(err.dataSource);    // what the rows actually carry
13
    console.log(err.role);          // "observations" | "forecasts" | "settlement"
14
  }
15
}

To opt out (e.g. for a one-time manual backfill that intentionally crosses sources), pass allow_source_drift="<reason>":

1
result = validate_dataframe(
2
    df,
3
    schema_id="schema.observation.v1",
4
    allow_source_drift="manual backfill from GHCNh into the IEM schema",
5
)
6
# audit_log carries both "registered" + "source_drift_allowed" entries

The opt-out is intentionally verbose: the reason string is required, and it shows up in the audit log so a code review can spot it.

Train / trade symmetry

The contract exists so that a feature backtests the same way it trades. Pin the same provenance on both sides of the seam:

1
import mostlyright
2
from mostlyright.weather import obs
3
from mostlyright.core import validate_dataframe
4

5
# TRAIN — pinned historical, per-report grain
6
train = obs("KNYC", "2024-01-01", "2024-12-31",
7
            source="awc", granularity="observation")
8
validate_dataframe(train, schema_id="schema.observation.v1")
9

10
# TRADE — the SAME provenance, live
11
tick = mostlyright.live.stream("KNYC", source="awc")

Because both sides pin source="awc", the source identity is consistent across the train/infer seam. If you accidentally route a fused (merged.live_v1) frame through a single-source pinned schema, validate_dataframe raises SourceMismatchError at load time — loud, before the mismatch can silently corrupt a model.

Source identity

The fused default — source-blind

The `source=` pin — single provenance

`delivery=` is a separate axis

Frame-vs-row identity

SourceMismatchError — the loud guard

Train / trade symmetry

See also

Source identity

The fused default — source-blind

The source= pin — single provenance

delivery= is a separate axis

Frame-vs-row identity

SourceMismatchError — the loud guard

Train / trade symmetry

See also

The `source=` pin — single provenance

`delivery=` is a separate axis