Chapter 10

CDC Design for Migration

Change Data Capture keeps the PostgreSQL target synchronized with the Oracle source from the moment the full load snapshot was taken until the moment of cutover. During this window — which may last days or weeks as CDC catches up and runs in parallel — every INSERT, UPDATE, and DELETE on Oracle must be replicated to PostgreSQL in order, without gaps, and with enough operational visibility to detect when it is falling behind.

This chapter covers the Oracle CDC mechanisms available, how SCN anchoring connects the full load to CDC, event deduplication, lag tracking, and the failure modes specific to CDC-based migration pipelines.

10.1 Oracle CDC Mechanisms

Three mechanisms exist for capturing Oracle change data:

Oracle LogMiner

LogMiner is Oracle's native log mining facility. It reads the redo log (and archived logs) and reconstructs the SQL that produced each change. Available in all Oracle editions without additional licensing.

-- LogMiner: add log files and start analysis
EXECUTE DBMS_LOGMNR.ADD_LOGFILE(LOGFILENAME => '/oracle/redo/redo01.log');
EXECUTE DBMS_LOGMNR.START_LOGMNR(
  STARTSCN => :anchor_scn,
  OPTIONS  => DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG +
              DBMS_LOGMNR.COMMITTED_DATA_ONLY +
              DBMS_LOGMNR.SKIP_CORRUPTION
);

-- Read change events
SELECT scn, operation, seg_owner, seg_name, sql_redo, sql_undo
FROM v$logmnr_contents
WHERE seg_owner = 'MYAPP'
AND operation IN ('INSERT', 'UPDATE', 'DELETE');

LogMiner produces SQL_REDO strings — the SQL that would reproduce the change. For INSERT: INSERT INTO account .... For UPDATE: UPDATE account SET ... WHERE .... For DELETE: DELETE FROM account WHERE ....

Limitations:

  • COMMITTED_DATA_ONLY means LogMiner waits until a transaction commits before delivering its events, which introduces lag equal to the commit latency of long-running transactions.
  • LogMiner is slow for high-volume schemas — it reads the log sequentially and requires the Oracle data dictionary to decode column names.
  • V$LOGMNR_CONTENTS is a session-level view; only one LogMiner session can run per Oracle session.

Debezium Oracle Connector

Debezium is an open-source CDC platform that uses LogMiner internally but wraps it in a reliable, resumable pipeline with Kafka as the event bus. The Debezium Oracle connector reads redo logs via LogMiner, converts events to Avro or JSON, and publishes them to Kafka topics (one topic per table).

{
  "op": "u",
  "before": { "account_id": 847331, "status": "ACTIVE" },
  "after":  { "account_id": 847331, "status": "SUSPENDED" },
  "source": {
    "scn": "24891044",
    "table": "MYAPP.ACCOUNT",
    "ts_ms": 1714176000000
  }
}

Debezium provides:

  • Resumable state (Kafka offset = SCN position)
  • Per-table parallelism via Kafka partitioning
  • Schema registry integration (Avro schema evolution tracking)
  • Out-of-the-box lag monitoring via Kafka consumer group lag

Limitations:

  • Requires Kafka and a Kafka Connect cluster
  • LogMiner licensing implications (Debezium uses LogMiner under the hood — the Oracle LogMiner supplement log must be enabled)
  • Higher operational complexity than direct LogMiner

Oracle GoldenGate

GoldenGate is Oracle's commercial CDC product. It provides the most reliable, highest-throughput Oracle CDC available and supports active-active replication (bidirectional). However:

  • Requires separate licensing (substantial cost)
  • Requires Oracle DBA expertise to operate
  • Overkill for most Oracle → PostgreSQL migrations where GoldenGate would be decommissioned post-cutover

Product note: Pulsaride Transform supports LogMiner (direct) and Debezium (via Kafka) as CDC sources. GoldenGate integration is available as a custom connector option. For most migrations, the choice is:

  • Direct LogMiner for simpler deployments without existing Kafka infrastructure
  • Debezium/Kafka for high-volume schemas (>10,000 changes/second) or when Kafka is already in use

10.2 Supplemental Logging

Oracle CDC requires supplemental logging to be enabled. Without it, the redo log records only the changed values, not the full before-image or the primary key values needed to identify which row changed.

Minimum requirement: supplemental log for primary keys

-- Enable supplemental logging for all tables
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;

-- Or per-table (more granular, less redo volume)
ALTER TABLE account ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS;
ALTER TABLE order_header ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS;

For CDC consumers that need before-images (e.g., Debezium with before field), all-column supplemental logging is required:

ALTER TABLE account ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

All-column supplemental logging increases redo volume significantly — every UPDATE writes the entire row before-image to redo, not just the changed columns. For wide tables (>50 columns), this can double redo volume.

Product note: Pulsaride Transform checks supplemental logging status during assessment using DBA_SUPPLEMENTAL_LOGGING. Tables without primary key supplemental logging are flagged as SUPPLEMENTAL_LOGGING_MISSING and the DDL to enable it is included in the assessment report. The migration cannot start until supplemental logging is confirmed for all tables in scope.

10.3 SCN Anchoring: Connecting Full Load to CDC

The anchor SCN is the point at which the full load snapshot was taken (Chapter 9). CDC starts from this SCN. But between the moment the anchor SCN was recorded and the moment CDC begins reading, some changes may have already arrived in the staging area as part of the CDC setup. The SCN anchor prevents double-applying or missing these.

Timeline:
  t=0: anchor_scn = 24,891,044. Full load begins.
  t=4h: Full load completes. 847,000 rows in PostgreSQL.
  t=4h: CDC starts from SCN 24,891,044.
  t=4h+X: CDC catches up to real-time. Gap = 4 hours of changes.
  t=4h+X: CDC is running real-time. System is in "parallel running" state.

During parallel running, Oracle is still the source of truth. CDC keeps PostgreSQL synchronized. Reconciliation runs continuously (Chapter 15) to verify that synchronization is accurate.

Duplicate Event Handling

CDC events are delivered at-least-once. Network interruptions, LogMiner restarts, or pipeline restarts may re-deliver events that were already applied. The staging layer must deduplicate:

-- Staging deduplication: apply only if SCN is higher than last applied
UPDATE account
SET status = :new_status, updated_at = :updated_at
WHERE account_id = :account_id
AND (last_applied_scn IS NULL OR last_applied_scn < :event_scn);

UPDATE account SET last_applied_scn = :event_scn
WHERE account_id = :account_id AND last_applied_scn < :event_scn;

This requires an last_applied_scn column in each target table — or a separate tracking table. The tracking table approach avoids schema changes on the target tables:

-- Staging SCN tracking table
CREATE TABLE _pulsaride_row_scn (
  table_name  TEXT NOT NULL,
  pk_value    TEXT NOT NULL,
  last_scn    NUMERIC NOT NULL,
  PRIMARY KEY (table_name, pk_value)
);

For high-volume tables, per-row SCN tracking is expensive. Pulsaride Transform uses a segment-level SCN tracker: for each commit in the CDC stream, record the maximum SCN applied per table. At restart, re-read all events from that SCN forward and apply idempotently using UPSERT:

-- Idempotent CDC apply via UPSERT
INSERT INTO account (account_id, user_name, email, status, updated_at)
VALUES (:account_id, :user_name, :email, :status, :updated_at)
ON CONFLICT (account_id) DO UPDATE SET
  user_name  = EXCLUDED.user_name,
  email      = EXCLUDED.email,
  status     = EXCLUDED.status,
  updated_at = EXCLUDED.updated_at
WHERE account.updated_at <= EXCLUDED.updated_at;

The WHERE account.updated_at <= EXCLUDED.updated_at clause ensures that a re-delivered older event does not overwrite a newer one. This requires updated_at to be a reliable event timestamp — which for Oracle CDC comes from the SCN-to-timestamp mapping.

10.4 Lag Tracking

CDC lag is the delay between when a change is committed in Oracle and when it is applied in PostgreSQL. Lag accumulates when the Oracle change rate exceeds the CDC apply rate.

During the catch-up phase (from the anchor SCN to real-time), lag starts high (equal to the full load duration) and decreases as CDC processes the backlog. A full load that took 4 hours means CDC starts with 4 hours of lag.

Lag Metrics

-- Oracle: current SCN
SELECT CURRENT_SCN FROM V$DATABASE;
-- Returns: 25,104,221

-- CDC pipeline: last applied SCN
SELECT MAX(last_applied_scn) FROM _pulsaride_cdc_checkpoint;
-- Returns: 24,891,044

-- SCN gap
-- 25,104,221 - 24,891,044 = 213,177 SCNs

-- Convert SCN gap to approximate time using timestamp correlation
SELECT TIMESTAMP_TO_SCN(SYSDATE - INTERVAL '30' MINUTE) FROM DUAL;
-- If this returns ~213177, lag ≈ 30 minutes

Product note: pulsaride-monitoring exposes a cdc_lag_seconds metric updated every 10 seconds. The metric is calculated by converting the SCN gap to a time delta using Oracle's SCN_TO_TIMESTAMP function. The dashboard shows a lag trend graph. An alert fires when cdc_lag_seconds > cdc_lag_alert_threshold (configurable, default: 300 seconds = 5 minutes).

Lag Growth Detection

Lag that grows continuously is a sign of CDC apply rate below Oracle commit rate. Causes:

  1. Oracle is generating more changes than the CDC pipeline can apply. Increase CDC worker count or PostgreSQL write concurrency.
  2. Large transactions in Oracle: A single large batch job that updates 10 million rows produces 10 million CDC events. LogMiner's COMMITTED_DATA_ONLY flag means none of these events are delivered until the batch commits — then they arrive in a burst.
  3. PostgreSQL write contention: CDC applies cause lock contention on frequently-updated tables, slowing apply rate.

10.5 CDC Gap Detection

A gap in the CDC stream is a range of SCNs for which no events are delivered, despite changes having occurred in that range. Gaps can arise from:

  • Archived log deletion: Oracle archives redo logs periodically. If an archived log is deleted before the CDC consumer reads it, the events in that log are lost.
  • LogMiner corruption skip: SKIP_CORRUPTION in LogMiner silently skips corrupted log blocks. Events in corrupted blocks are lost.
  • Out-of-order log switch: In some Oracle configurations, log switches during heavy I/O can cause brief reordering that LogMiner misses.

Gap detection compares the expected SCN range against the observed event stream:

-- Expected: events from SCN 24,891,044 to current
-- Observed: events from SCN 24,891,044 to 24,944,882, then jump to 25,011,100

-- Gap: SCNs 24,944,883 to 25,011,099 — no events observed
-- This range spans approximately 66,217 SCNs

Closing a gap requires:

  1. Determining whether the gap contains relevant changes (changes to tables in scope)
  2. If yes: re-reading the Oracle source for those rows to find their current state and reconcile
  3. If no: marking the gap as known-empty and proceeding

Product note: Pulsaride Transform maintains a _pulsaride_scn_gaps table that records all detected SCN gaps. For each gap, the reconciliation step runs a targeted row-level comparison between Oracle and PostgreSQL for all tables that had activity in that SCN range (determined from LogMiner's table-level statistics). If discrepancies are found, they are applied as compensating writes.

10.6 Example: Catching Up 72 Hours of Lag

A migration project starts the full load on Monday morning. The load completes Tuesday at 2 AM (42 hours). CDC starts from the Monday morning SCN. By Tuesday at 2 AM, CDC has 42 hours of lag.

CDC catch-up rate: the pipeline can apply 8,000 changes/second. Oracle generates 3,000 changes/second on average.

Net catch-up rate: 8,000 - 3,000 = 5,000 SCNs worth of changes per second.

42 hours of lag at 3,000 changes/second = 42 × 3,600 × 3,000 = 453,600,000 events.

Time to catch up: 453,600,000 / 5,000 = 90,720 seconds ≈ 25.2 hours.

CDC reaches real-time at approximately Wednesday at 3:15 AM — 25 hours after starting.

During Wednesday:

  • CDC lag drops to < 30 seconds
  • The reconciliation dashboard shows green for all 500 tables
  • Cutover window is scheduled for Saturday night (3 days of parallel running to build operator confidence)

The lag trend chart (from pulsaride-monitoring) shows a smooth decrease from 42 hours to < 30 seconds over the 25-hour catch-up window, with a brief spike on Tuesday afternoon when a batch job ran and generated 180,000 changes in 12 minutes (a short lag increase that resolved within 2 minutes).

This timeline is the standard migration trajectory: full load takes 1–3 days, CDC catch-up takes 1–2 days, parallel running 1–7 days before cutover.

← Previous

Chapter 9Full Load Before CDC

Next →

Chapter 11Staging as Product Memory