6.3 Background Scheduler

The Background Scheduler is a dedicated service responsible for orchestrating the end-to-end data pipeline. It ensures that the ATLAS and OSM datasets are periodically synchronized, matched, and imported into the database without manual intervention.

Core Role

The scheduler automates the transition through the four main phases of the project:

  1. 1. Download and Process Data: Fetching official ATLAS exports and OSM overpass data.
  2. 2. Matching Process: Running the multi-stage geospatial association logic.
  3. Problem detection (3. Problems): Identifying data quality issues.
  4. 4.1 Import Process: Rebuilding the import_db with fresh results.

Implementation Details

Service Architecture

The scheduler is implemented as an APScheduler (BlockingScheduler) instance running within a dedicated Docker container (scheduler).

  • Entrypoint: matching_and_import_db/scheduler/service.py
  • Logic Runner: matching_and_import_db/scheduler/job_runner.py

Redis Integration & Locking

To ensure system stability, the scheduler interacts with Redis for two critical functions:

  1. Distributed Lock: Before starting a run, the scheduler attempts to acquire a pipeline_lock in Redis. This prevents multiple triggers (e.g., a scheduled task and a manual docker exec) from running simultaneously and corrupting the data.
  2. Status Reporting: The scheduler publishes its current state (e.g., downloading, matching, importing) to Redis. The Flask web application consumes this data to display a real-time progress bar and status message to users.

Configuration

The scheduler's behavior is controlled via environment variables in the scheduler service:

Variable Description Default
PIPELINE_SCHEDULE_INTERVAL_HOURS Interval between automatic runs, in hours 24
PIPELINE_TIMEZONE Timezone used when computing the next run timestamp Europe/Zurich
PIPELINE_LOG_LEVEL Verbosity of the pipeline logs INFO

Operational Commands

Manual Trigger

You can force a pipeline run immediately by executing the job runner inside the running scheduler container:

docker compose exec scheduler python -m matching_and_import_db.scheduler.job_runner --mode full --trigger manual

Checking Status

The status can be checked via the API endpoint:
GET /api/system/pipeline_status


Error Handling

If a phase fails (e.g., a network timeout during OSM download), the scheduler:

  1. Logs the traceback to stdout.
  2. Updates the Redis status to failure with the error message.
  3. Releases the distributed lock so subsequent runs can still execute.
  4. Retains the old database state (since the import phase is only reached after successful matching).
Data update running in background
Preparing update... | Phase: initializing
Data update in progress
Core data is being refreshed. Use this time to read the documentation.
Elapsed: -- ETA: -- Phase: idle