Chapter 18

Observability and Operator Workflow

A migration pipeline that is running but not observable is not a pipeline — it is a process with an unknown state. Observability is the property that allows an operator to answer "what is this system doing right now" and "what did it do in the last hour" from external signals, without instrumenting the code or reading internal state.

This chapter covers what metrics a migration pipeline must emit, how dashboards structure those metrics for operator consumption, how alerts are designed to be actionable rather than noisy, and how the operator workflow integrates with runbooks during parallel running and cutover.

18.1 The Three Observability Signals

Observability in the context of a migration pipeline requires three types of signals:

Metrics: Numeric measurements sampled over time. Latency, throughput, row counts, lag, error rates. These answer "how fast" and "how many."

Events: Discrete occurrences with a timestamp and payload. A chunk completing, a CDC batch applying, a reconciliation discrepancy detected. These answer "what happened and when."

Logs: Free-text records of pipeline activity, typically used for debugging. Less structured than events; more verbose. These answer "why did this happen."

The monitoring dashboard (pulsaride-monitoring module) surfaces metrics and events. Logs are accessible via the standard application logging framework (logback/slf4j, forwarded to whatever log aggregation the team uses).

18.2 Core Metrics

Full Load Metrics

pulsaride_full_load_rows_total{table}          — total rows loaded (counter)
pulsaride_full_load_rows_per_second{table}     — current load rate (gauge)
pulsaride_full_load_chunks_complete{table}     — chunks completed (counter)
pulsaride_full_load_chunks_total{table}        — total chunks in plan (gauge)
pulsaride_full_load_pct_complete{table}        — percent complete (gauge, 0–100)
pulsaride_full_load_eta_seconds{table}         — estimated time to completion (gauge)
pulsaride_full_load_rejected_rows{table}       — rows rejected (counter)

CDC Metrics

pulsaride_cdc_lag_seconds{table}               — current CDC lag in seconds (gauge)
pulsaride_cdc_events_applied{table}            — events applied (counter)
pulsaride_cdc_events_skipped{table}            — events skipped (idempotent replay)
pulsaride_cdc_events_pending{table}            — events waiting for FK resolution
pulsaride_cdc_apply_rate{table}                — events applied per second (gauge)
pulsaride_cdc_scn_current                      — current Oracle SCN (gauge)
pulsaride_cdc_scn_applied                      — last applied Oracle SCN (gauge)

Reconciliation Metrics

pulsaride_recon_discrepancies{table, class}    — discrepancy count by classification
pulsaride_recon_tables_passing                 — count of tables passing all modes
pulsaride_recon_tables_failing                 — count of tables with FAIL status
pulsaride_recon_last_run_duration_seconds      — duration of last reconciliation cycle
pulsaride_recon_false_positive_rate            — percent of discrepancies that are false positives

Product note: All Pulsaride Transform metrics are exposed via a Prometheus-compatible /pulsaride/metrics endpoint. The pulsaride-monitoring Spring Boot auto-configuration registers this endpoint automatically when the dependency is included in the Spring Boot application. Metrics can be scraped by Prometheus and visualized in Grafana, or consumed directly via the built-in dashboard at /pulsaride/dashboard.

18.3 The Monitoring Dashboard

The built-in monitoring dashboard at /pulsaride/dashboard is designed for operators who do not want to set up a separate metrics stack. It provides a real-time view of the migration state with no configuration beyond including the pulsaride-monitoring dependency.

Dashboard sections:

Pipeline Status Banner

At the top: overall pipeline status (FULL_LOAD / CDC_CATCHUP / PARALLEL_RUNNING / CUTOVER_READY) and the current phase duration.

Full Load Progress Table (visible during full load)

Table Progress Rate ETA Status
account 84.2% (714K/847K) 12,400 r/s 11m 40s RUNNING
transaction 93.1% (44.7M/48M) 9,800 r/s 5m 20s RUNNING
order_header 100% Done COMPLETE

Color coding: green (complete), blue (running), yellow (slow — < 50% of expected rate), red (failed or stalled).

CDC Lag Chart

A 24-hour rolling time-series chart of CDC lag per table. Shows the catch-up trajectory. Threshold lines at 5 minutes (warning) and 30 seconds (target). A table that crosses the 30-second threshold and stays below it for 4 hours is eligible for reconciliation in parallel-running mode.

Reconciliation Summary (visible during parallel running)

Table Mode Status Last Run Discrepancies
account count GREEN 2m ago 0
ledger_entry hash GREEN 14m ago 0
order_header count YELLOW 2m ago 6 (CDC lag)
customer sample GREEN 8m ago 0

Event Stream

A scrollable event log showing the last 100 pipeline events, each with timestamp, type, table, and detail. Operators can filter by table or event type.

18.4 Alert Design

Alerts must be actionable. An alert that fires but requires no immediate action trains operators to ignore alerts. Every alert should answer three questions:

  1. What is happening? (The symptom)
  2. Why does it matter? (The impact)
  3. What should I do? (The runbook step)

Alert: CDC Lag High

Alert: CDC_LAG_HIGH
Table: transaction
Severity: WARNING (lag > 5 minutes) / CRITICAL (lag > 30 minutes)
Message: CDC lag for 'transaction' is 18m 32s. 
         At current apply rate, will reach real-time in ~2 hours.
Runbook: https://docs.pulsaride.com/runbooks/cdc-lag-high
Action: Check CDC worker count (current: 4). 
        Consider increasing to 8 if PostgreSQL write throughput allows.
        Check for Oracle batch jobs running (may cause lag spike).

Alert: Reconciliation Fail

Alert: RECON_FAIL
Table: ledger_entry
Severity: CRITICAL
Message: ledger_entry hash reconciliation failed. 3 rows with VALUE_DRIFT 
         in segment 73. Financial table — blocking for cutover.
Runbook: https://docs.pulsaride.com/runbooks/recon-fail
Action: Review discrepancy detail in dashboard. 
        For VALUE_DRIFT: fetch Oracle values and generate compensating write.
        Compensating write template: 
        UPDATE ledger_entry SET amount = :oracle_amount 
        WHERE ledger_id = :ledger_id AND amount != :oracle_amount;

Alert: Full Load Stalled

Alert: FULL_LOAD_STALLED
Table: customer_document
Severity: CRITICAL
Message: customer_document has not progressed in 12 minutes.
         Last chunk: 4,301/9,600. Expected rate: 8,200 r/s. Current rate: 0.
Runbook: https://docs.pulsaride.com/runbooks/full-load-stalled
Action: Check Oracle undo availability (ORA-30052 risk).
        Check PostgreSQL connection status.
        Check staging ledger for IN_PROGRESS chunks.

Product note: pulsaride-monitoring includes a webhook-based notification system that sends alerts to Slack, PagerDuty, or any HTTP endpoint. Alert routing is configured in the monitoring YAML:

monitoring:
  alerts:
    - type: CDC_LAG_HIGH
      severity: WARNING
      webhook: https://hooks.slack.com/services/T00/B00/xyz
      channel: '#migration-ops'
    - type: RECON_FAIL
      severity: CRITICAL
      webhook: https://api.pagerduty.com/events/v2
      routingKey: '${PAGERDUTY_ROUTING_KEY}'

18.5 Runbook Integration

Runbooks are the documents that tell an operator what to do when an alert fires. They should be specific, actionable, and maintain their currency (no runbook that says "contact John" when John left 6 months ago).

The Pulsaride Transform runbook structure:

Runbook: CDC Lag High
---
Trigger: pulsaride_cdc_lag_seconds > 300 (5 minutes)

Step 1: Check Oracle batch activity
  Query Oracle:
    SELECT sid, serial#, username, program, sql_text, last_call_et
    FROM v$session s JOIN v$sql q ON s.sql_id = q.sql_id
    WHERE s.status = 'ACTIVE' AND s.username = 'MYAPP'
    AND last_call_et > 60;
  If a long-running batch is visible: wait for it to complete (lag will spike, then recover).
  If no batch: proceed to Step 2.

Step 2: Check CDC worker health
  In monitoring dashboard → Event Stream → filter by 'CDC_WORKER'
  If workers are running but slow: increase worker count in config.
  If workers are stopped: restart CDC component.

Step 3: Check PostgreSQL write latency
  Query PostgreSQL:
    SELECT pg_stat_activity.query, wait_event, wait_event_type
    FROM pg_stat_activity
    WHERE state = 'active' AND query NOT LIKE 'SELECT%';
  If blocked writes: identify blocking query and terminate if safe.

Step 4: If lag > 60 minutes and not recovering:
  Escalate to migration lead.
  Evaluate whether cutover date must be adjusted.

18.6 The Operator Shift Model

A migration that runs for 3–7 days requires operators in shifts. The handoff between shifts must preserve context: what happened in the last shift, what alerts were fired and resolved, what the current status is, and what actions are planned for the next shift.

The monitoring dashboard's shift summary report (exported at the end of each shift):

SHIFT SUMMARY — April 15, 2026
Shift: 22:00–06:00 UTC (Operator: lsalihi)

FULL LOAD: Complete (100% as of 23:42 UTC)

CDC STATUS at 06:00:
  Average lag: 4m 12s
  Tables at < 30s lag: 298/340
  Tables still catching up: 42

ALERTS FIRED:
  [23:12] CDC_LAG_HIGH: transaction (lag 28m) — batch job ran, resolved at 23:41
  [02:44] RECON_FAIL: ledger_entry — VALUE_DRIFT 3 rows — compensating write applied at 03:15

RECONCILIATION SUMMARY at 06:00:
  GREEN: 338 tables
  YELLOW: 2 tables (order_header, order_line — CDC lag delta, expected)
  RED: 0 tables

ACTIONS FOR NEXT SHIFT:
  1. Monitor CDC catch-up for 42 remaining tables — expected by ~08:30
  2. Run hash reconciliation on ledger_entry after CDC settles
  3. Check SLA gate 'financial_tables_clean' — enters 24h window at 06:00 if clean

18.7 Example: The 3 AM Alert and the 8-Minute Resolution

At 03:12 AM, the PagerDuty alert fires: RECON_FAIL — ledger_entry — VALUE_DRIFT.

The on-call operator opens the monitoring dashboard from their phone. The discrepancy detail shows 3 rows with a $0.01 difference in amount.

The operator follows the runbook:

  1. Classify discrepancy: VALUE_DRIFT (not NORMALIZATION_RULE — no normalization applies to amount).

  2. Fetch Oracle values: Run the provided SQL template against Oracle to get the authoritative values.

  3. Generate compensating write: The dashboard's "Generate Fix" button creates:

    UPDATE ledger_entry SET amount = 1500.0000 WHERE ledger_id = 9201441 AND amount != 1500.0000;
    UPDATE ledger_entry SET amount = 750.0000  WHERE ledger_id = 9201442 AND amount != 750.0000;
    UPDATE ledger_entry SET amount = 2250.0000 WHERE ledger_id = 9201889 AND amount != 2250.0000;
    
  4. Apply the compensating write: Approved in the dashboard. Applies in 0.3 seconds.

  5. Re-run reconciliation: Triggered manually from the dashboard. Duration: 14 minutes. Result: PASS.

  6. Close the alert: Mark as resolved. Alert closed at 03:20 AM.

Total time from alert to resolution: 8 minutes. The on-call operator did not need to SSH into any server, write any SQL manually, or understand the migration pipeline internals. The observability system surfaced the exact problem and the exact fix. The operator's role was review and approval.

This is the operational model that pulsaride-monitoring enables: operators who act on information, not operators who gather it.

← Previous

Chapter 17How Reconciliation Improves the Product

Next →

Chapter 19Cutover Readiness and the Final Switch