Distance Matching

Distance matching uses spatial proximity to produce matches. It consists of four predicate classes contributing six sequential predicate runs, with a strict trio-first stage followed by progressively looser stages. Each predicate receives the same MatchingContext after it has been mutated by previous predicates via ctx.commit().

flowchart LR C0["Context start"] --> T["Trio"] --> C1["C1"] C1 --> G["Group proximity"] --> C2["C2"] C2 --> L["Local ref"] --> C3["C3"] C3 --> N1["Nearest 3a"] --> C4["C4"] C4 --> N2["Nearest 3b"] --> C5["C5"] C5 --> N3["Nearest 3a retry"] --> C6["Context end"]

Overview

Predicate Class match_type Matches
TrioDistanceMatchingPredicate distance_matching_trio 2,168
GroupProximityPredicate distance_matching_1_* 23,445
LocalRefDistancePredicate distance_matching_2 121
NearestDistancePredicate (single pass 1) distance_matching_3a 405
NearestDistancePredicate (ratio) distance_matching_3b 77
NearestDistancePredicate (single pass 2) distance_matching_3a_second_pass 56

Total: 26,272 distance matches

Predicates

Distance matching contributes six sequential predicate executions in the pipeline: trio, group proximity, local-ref, nearest 3a, nearest 3b, and nearest 3a retry. These runs do not execute in parallel. Each run operates on AtlasNode and OsmNode domain entities and records matches via ctx.commit(). The runner refreshes unmatched records between calls, so every stage sees the updated post-commit state.

TrioDistanceMatchingPredicate - Trio Side Matching First

flowchart TB A["Trio representative"] --> B{"2 side nodes free\nand same UIC?"} B -->|No| X["Skip trio"] B -->|Yes| C{"Exactly 2 unmatched\nATLAS for that UIC?"} C -->|No| X C -->|Yes| D["Evaluate both 2x2 assignments"] D --> E["Pick lower total distance"] E --> F["Commit side 1 + side 2\nmatch_type=distance_matching_trio"] F --> G["Leave middle unmatched"]

This stage runs before all other distance predicates.

For each pre-grouped trio (osm_trio):

Before this predicate even runs, trio registration already enforces one spatial locality guard: both non-middle side nodes must be within 15 m of the middle stop_position. If that guard fails, the UIC is not treated as a trio and the nodes remain available for the later pair-grouping paths.

  1. Identify the two side nodes and the explicit middle node from trio metadata.
  2. Read unmatched ATLAS rows for the trio UIC.
  3. Continue only if exactly two ATLAS rows are available.
  4. Evaluate both possible 2x2 assignments and choose the one with lower total haversine distance.
  5. Keep the middle node intentionally unmatched in this stage.

All matches created here use match_type = distance_matching_trio.

If a trio does not satisfy the cardinality constraints at execution time, the predicate skips it and leaves later stages to decide.

GroupProximityPredicate - Group-Based Conflict-Free Assignment

flowchart TB K["Key in order:\nuic_ref -> uic_name -> name"] --> A["Build ATLAS groups"] A --> B["Fetch unmatched grouped OSM"] B --> C["bipartite_match_max_cardinality()"] C --> D{"Any valid pair\n(distance <= 50m)?"} D -->|No| X["Skip group"] D -->|Yes| F["Commit conflict-free pairs\nmatch_type=distance_matching_1_*"]

Tries three grouping strategies in order:

  1. UIC reference: AtlasNode.uic_refOsmNode.uic_ref
  2. UIC name: AtlasNode.designation_officialOsmNode.uic_name
  3. Official name: AtlasNode.designation_officialOsmNode.name

For each group, calls the shared bipartite_match_max_cardinality() helper.

Matching objective inside each group is deterministic and conflict-free:

  1. Maximize number of matched pairs with distance <= 50m.
  2. Among those maximum-cardinality solutions, minimize total matched distance.

This allows partial group matching: if a group contains outliers, matchable rows are still committed and only rows without valid edges remain unmatched.

The helper builds a full pairwise haversine distance matrix using numpy broadcasting and solves a one-to-one assignment with dummy unmatched slots using scipy.optimize.linear_sum_assignment.

Grouped lookup is representative-aware: when an OSM representative has siblings, the representative is indexed by key values found on itself and on its siblings. This preserves UIC/name anchor visibility after pre-grouping.

The current implementation does not perform a separate stop-position-only retry. It groups whatever unmatched, non-station OSM entities are available in OsmState for the active key and attempts a single conflict-free bipartite assignment.

LocalRefDistancePredicate - Exact local_ref Within Radius

flowchart TB A["Unmatched ATLAS"] --> B["batch_query_radius(50m)"] B --> C["Per ATLAS: re-filter used OSM"] C --> D{"designation == local_ref\n(case-insensitive)?"} D -->|No| X["Skip"] D -->|Yes| E["Pick closest"] E --> F["Commit\nmatch_type=distance_matching_2"]

Matches where AtlasNode.designation exactly equals OsmNode.local_ref, case-insensitively, within 50m. Uses KDTree spatial index via batch_query_radius() to find nearby candidates. When multiple candidates match, the closest wins.

Before picking a winner, the current implementation now re-validates the batched candidate list against the live used_ids set. This is necessary because earlier rows in the same predicate run can consume an OsmNode after the original batch query was computed.

NearestDistancePredicate - Single Candidate / Ratio Test / Single Retry

flowchart TB A["Unmatched ATLAS"] --> B["batch_query_radius(50m)"] B --> C["Remove used + contradicting local_ref"] C --> D{"Pass mode?"} D -->|single| E{"Exactly 1 candidate?"} E -->|Yes| F["Commit 3a or 3a_second_pass"] E -->|No| X["Skip"] D -->|ratio| G{"At least 2 candidates?"} G -->|No| X G -->|Yes| H{"d2 >= 10 and d2/d1 >= 4?"} H -->|No| X H -->|Yes| I["Commit 3b using closest"]

local_ref contradiction filter: Before evaluating candidates, any OsmNode whose local_ref contradicts the AtlasNode.designation is excluded (case-insensitive comparison). If either value is empty, no contradiction exists and the candidate passes through. This prevents false matches where a nearby OSM node clearly belongs to a different stop.

The predicate is executed three times in the pipeline:

  1. Pass 1 (3a): single-candidate only
  2. Pass 2 (3b): ratio-test only
  3. Pass 3 (3a retry): single-candidate only

Pass 3 exists to capture ATLAS rows that become unambiguous only after pass 2 consumes a competing OSM candidate.

Two decision sub-cases:

Case 3a — Single candidate: If exactly one non-contradicting, unused OsmNode that passes OsmNode.is_station filtering exists within 50m of an AtlasNode → match (distance_matching_3a on pass 1, distance_matching_3a_second_pass on pass 3).

Case 3b — Ratio test: When multiple non-contradicting candidates exist within 50m, applies a ratio test to determine if the closest is sufficiently closer than the second-closest:

Match if:  d₂ ≥ 10m  AND  d₂ / d₁ ≥ 4
Candidate Distance Decision
OSM A 5m Matched (d₂/d₁ = 25/5 = 5 ≥ 4)
OSM B 25m
OSM C 40m

Constants: RATIO_TEST_MIN_D2 = 10 metres, RATIO_TEST_FACTOR = 4.

Commit-Safe Candidate Revalidation

LocalRefDistancePredicate and NearestDistancePredicate both use a batched KDTree query for performance, but they no longer trust that batch blindly until commit time.

The important sequence is:

  1. batch_query_radius() returns candidate snapshots for all unmatched ATLAS rows.
  2. The predicate loop processes those rows one by one.
  3. ctx.commit() immediately marks the chosen OSM representative as used.
  4. Later rows re-filter their old candidate snapshot against the current used_ids set before deciding whether they have 0, 1, or multiple candidates.

Without this revalidation step, a later row could incorrectly treat an already-consumed OSM representative as its only candidate and produce a bad distance_matching_3a match.

Spatial Index: KDTree

All distance predicates use a KDTree spatial index for efficient nearest-neighbor queries, built on unit-sphere coordinates (3D xyz) rather than raw lat/lon.

The LocalRefDistancePredicate and NearestDistancePredicate use batched queries: all unmatched AtlasNode coordinates are collected into a list and passed to OsmState.batch_query_radius(), which calls query_ball_point(points, workers=-1) once — delegating the entire loop to SciPy's multithreaded C backend.

# Batched query (used by LocalRefDistancePredicate, NearestDistancePredicate)
coords = [(e.lat, e.lon) for e in unmatched]
batch_candidates = ctx.osm.batch_query_radius(coords, ctx.max_distance)

The spatial index is built once from available nodes that pass the current include_stations / OsmNode.is_station filter and reused across queries. used_ids and sibling nodes are filtered at query time, not at build time. Implementation is in utils/spatial_index.py and state.py.

No-Nearby-OSM Detection

After all predicates complete, the importer flags unmatched ATLAS entries that have no OSM node at all within 50m. This does not create matches — it identifies entries for problem detection.

Used-Node Prevention

No single OSM representative is matched more than once within the distance predicates — ctx.commit() immediately locks nodes via mark_used(), and later rows re-check that live state before committing. However, this does not imply a strict 1:1 relationship at the station level: since OSM often includes multiple nodes for a single stop (e.g. platform and stop_position), the distance matching process can validly match these distinct OsmNode entities to separate AtlasNode entries.

Code Reference

Class / Function Description
TrioDistanceMatchingPredicate Trio-first distance matching for side nodes with explicit middle-node exclusion
GroupProximityPredicate Conflict-free maximum-cardinality proximity assignment within UIC/name groups
LocalRefDistancePredicate Exact local_ref matching within 50m radius (batched KDTree queries)
NearestDistancePredicate Multi-pass nearest matching: 3a pass 1, 3b ratio, 3a pass 2 (batched KDTree queries)
bipartite_match_max_cardinality(atlas, osm, max_distance) Shared helper: vectorized distance matrix + max-cardinality min-distance assignment

Distance predicates are in predicates/distance_matching.py plus predicates/trio_distance_matching.py, with spatial utilities in utils/spatial_index.py and state.py.

Data update running in background
Preparing update... | Phase: initializing
Data update in progress
Core data is being refreshed. Use this time to read the documentation.
Elapsed: -- ETA: -- Phase: idle