4.1 Import Process

The database import process transforms processed CSV files and match outputs into the PostgreSQL database that powers the web application. We use PostgreSQL 15+ with PostGIS 3.x for spatial queries.

High-Level Architecture

flowchart TB subgraph Input["Input Files"] CSV["Processed CSVs"] Match["Match Results<br/>(matched, unmatched_atlas, unmatched_osm)"] end subgraph Import["Import Pipeline (matching_and_import_db/database/importer.py)"] M["1. Run Alembic Migrations"] C["2. TRUNCATE Import Tables"] R["3. Load Route Mappings"] I["4. Import OSM, Stop Units, Stops, Routes"] P["5. Detect Problems During Stop Import"] M --> C --> R --> I --> P end subgraph ImportDB["Import DB (import_db)"] Op["stops_matched, atlas_stops, osm_nodes,<br/>osm_stops, osm_stop_members,<br/>atlas_routes, atlas_route_directions,<br/>osm_routes, osm_route_tags,<br/>route_atlas_stops, route_osm_stops,<br/>routes_matched, route_problems, problems"] end CSV --> R Match --> I

Table Behavior

Category Database Tables Behavior
Import import_db stops_matched, atlas_stops, osm_nodes, osm_stops, osm_stop_members, atlas_routes, atlas_route_directions, osm_routes, osm_route_tags, route_atlas_stops, route_osm_stops, routes_matched, route_problems, problems Truncated and repopulated on every run

The database is intentionally reproducible from source files. Every import run wipes current import data to ensure consistency with the latest processed pipeline output.

Import Steps

The import_to_database() function in matching_and_import_db/database/importer.py orchestrates the import.

1. Schema Update

ensure_schema_updated()  # Runs Alembic migrations to HEAD

Pending Alembic migrations are applied first so the database schema matches the current ORM models.

2. Truncate Import Tables

session.execute(text("TRUNCATE TABLE atlas_stops, osm_nodes, osm_stops, osm_stop_members, route_atlas_stops, route_osm_stops CASCADE"))
session.execute(text("TRUNCATE TABLE routes_matched CASCADE"))
session.execute(text("TRUNCATE TABLE problems, stops_matched CASCADE"))

The importer clears the import tables up front to guarantee a fresh state for the new run.

3. Load Route Mappings

The route loader preloads all route CSVs in one pass:

  • atlas_routes.csv
  • atlas_route_directions.csv
  • atlas_route_stops.csv
  • osm_routes.csv
  • osm_route_tags.csv
  • osm_route_members.csv

This reduces repeated file I/O and prepares entity-first payloads for route insertion.

4. Insert Entity Data

The importer writes data in this order:

  1. Import all OSM nodes into osm_nodes so dependent route-stop rows can resolve their stop members.
  2. Import OSM stop units into osm_stops.
  3. Import OSM stop membership into osm_stop_members.
  4. Import matched stops into stops_matched, creating atlas_stops rows on demand and evaluating matched-stop problems.
  5. Import unmatched ATLAS stops into stops_matched, creating atlas_stops rows on demand and evaluating unmatched-stop problems.
  6. Import unmatched OSM stops into stops_matched and evaluate unmatched-stop problems.
  7. Import route entities and route membership into atlas_routes, atlas_route_directions, osm_routes, osm_route_tags, route_atlas_stops, route_osm_stops, routes_matched, and route_problems.

Records are accumulated in arrays and committed in batches. The importer uses batched persistence plus periodic session.commit() checkpoints to keep imports fast without holding the entire run in a single transaction.

export DB_IMPORT_BATCH_SIZE=5000

Progress is logged with timing information during long imports.

5. Detect Problems

A ProblemContext is built once from the pipeline output, precomputing structures such as KDTrees, UIC counts, and duplicate maps. Then run_problem_pipeline(STOP_PROBLEM_PIPELINE, ctx, stop) is called for each stop.

The stop-level predicates currently include:

  • distance_problem
  • attributes_problem
  • duplicates_problem
  • unmatched_problem

Each predicate yields zero or more ProblemResult objects with a priority score, which are then converted into ORM Problem rows. See 3.1 Stop Problems for the stop-problem rules.

Route-level problems are computed separately during route payload construction and persisted to route_problems.

Running the Import

Environment Variables

Variable Description Default
DATABASE_URI Import database connection string postgresql+psycopg://stops_user:1234@localhost:5432/import_db
DB_IMPORT_BATCH_SIZE Records per batch commit 5000

Command

export DATABASE_URI="postgresql://user:pass@localhost/stopdb"
python matching_and_import_db/database/importer.py

# Or skip certain phases to focus on selected matching stages
python matching_and_import_db/database/importer.py --skip-phase1 --skip-phase2

Persistence Behavior

The current importer does not preserve user modifications between runs. Every import rebuilds the import database from source files.

Related Documentation

  • 4. Database — Database architecture, schema, and indexing
Data update running in background
Preparing update... | Phase: initializing
Data update in progress
Core data is being refreshed. Use this time to read the documentation.
Elapsed: -- ETA: -- Phase: idle