Chapter 13

Transactional Safety and Rollback

The migration pipeline writes to PostgreSQL continuously from the moment the full load starts. Every write is an opportunity for inconsistency: a partial write that leaves the target in a state that neither fully represents Oracle nor represents a clean PostgreSQL baseline. Transactional safety is the set of design decisions that ensure every write is either completely applied or completely absent — and that the pipeline can return to a known-good state if anything goes wrong.

Rollback is what you do when something goes wrong at the worst possible moment — during cutover. The ability to roll back a migration in progress, cleanly, without data loss, without extended downtime, is not a feature. It is the contract with the business that makes the cutover a controlled operation rather than a gamble.

13.1 The Atomicity Requirements of Migration

A production migration involves three distinct atomicity domains:

Domain 1: Individual row writes. Each row loaded during the full load or applied during CDC must either be fully written or not written at all. A row that is half-written (e.g., a crash during a multi-column UPDATE) is corrupted data.

This is provided by PostgreSQL's transaction semantics: a row write that does not commit (due to crash or rollback) is not visible to readers. The pipeline achieves this by writing rows within transactions that are committed after each chunk or CDC batch.

Domain 2: Chunk-level atomicity. The full load processes tables in chunks. A chunk must either be fully loaded (all rows in the chunk are written and committed) or not loaded at all (so it can be safely retried). Partial chunks are detected on restart by comparing the loaded row count against the expected row count and re-loading if they differ.

Domain 3: Cutover atomicity. The cutover — the moment when the application is switched from Oracle to PostgreSQL — must be atomic from the application's perspective. Requests must not be partially served by Oracle and partially by PostgreSQL. The cutover must be a single, hard switch.

13.2 Savepoints and Long Transactions

PostgreSQL supports savepoints within a transaction. Savepoints allow partial rollback within a multi-statement transaction:

BEGIN;
  INSERT INTO account (account_id, user_name) VALUES (1001, 'Alice');
  SAVEPOINT sp_account_1001;
  
  INSERT INTO account_profile (profile_id, account_id) VALUES (5001, 1001);
  -- This insert fails: FK violation (account_id 1001 doesn't match anything)
  
  ROLLBACK TO SAVEPOINT sp_account_1001;
  -- account row is preserved; profile insert is rolled back
  
  -- Try with correct data
  INSERT INTO account_profile (profile_id, account_id) VALUES (5001, 1001);
COMMIT;

The migration pipeline uses savepoints for CDC event batches where individual events may fail without requiring the entire batch to be rolled back:

BEGIN
  Apply event 1 (INSERT account_id=9901) → SAVEPOINT s1
  Apply event 2 (UPDATE account_id=9901) → SAVEPOINT s2
  Apply event 3 (INSERT order_id=4401, fk to account_id=9902 that doesn't exist yet)
    → ROLLBACK TO s2
    → Queue event 3 for retry (account_id=9902 will arrive later)
  Apply event 4 (INSERT account_id=9902) → SAVEPOINT s3
  Retry event 3 (INSERT order_id=4401, fk to account_id=9902) → SAVEPOINT s4
COMMIT

Product note: Pulsaride Transform uses savepoints for FK-violation retries during CDC. When a CDC event fails due to FK violation (the referenced row has not been applied yet), the event is placed in a pending queue. After the current batch completes, the queue is re-tried. If the referenced row arrives in the same transaction batch (just out of order), the retry succeeds. If the referenced row is in a future transaction, the event waits for the next batch that includes it. Pending events that wait more than 60 seconds are escalated to ORPHANED_EVENT findings.

13.3 Compensating Writes

A compensating write is a write that undoes the effect of a previous write. They are used when:

  1. A CDC event is applied incorrectly (wrong value, applied out of order)
  2. A full load chunk loaded data that conflicts with a post-load constraint
  3. A reconciliation run detects a discrepancy that requires correction

Compensating writes must be designed idempotently: applying the same compensating write twice has the same effect as applying it once.

-- Example: compensating write for an incorrectly applied UPDATE
-- Original Oracle row (from reconciliation): status = 'ACTIVE'
-- Incorrect PostgreSQL state: status = 'SUSPENDED'
-- Compensating write:
UPDATE account
SET status = 'ACTIVE',
    last_compensated_at = NOW(),
    last_compensated_by = 'recon_run_4421'
WHERE account_id = 847291
AND status = 'SUSPENDED';  -- only compensate if the wrong value is present

The WHERE status = 'SUSPENDED' guard makes the write idempotent: if the account has already been compensated back to 'ACTIVE', the UPDATE matches no rows and is harmless.

Product note: Pulsaride Transform generates compensating writes from reconciliation output (Chapter 16). Each discrepancy record in the reconciliation report includes the source value, the target value, and the generated compensating SQL. The operator approves compensating writes before they are applied. Applied compensating writes are recorded in _pulsaride.compensating_writes for audit purposes.

13.4 Rollback Design

Rollback is the operation of returning the system to its pre-migration state. True rollback — reverting PostgreSQL to an empty state and restoring Oracle as the sole source of truth — is possible at any point before cutover, because Oracle was never modified. The pipeline only ever reads Oracle; it writes only to PostgreSQL.

Before cutover: Rollback is trivial. Stop the pipeline, drop the PostgreSQL schema, return to Oracle. Oracle is unchanged. No data is lost.

At cutover: This is the critical window. Cutover involves stopping application writes to Oracle and redirecting them to PostgreSQL. The moment application traffic hits PostgreSQL, rollback becomes more complex because PostgreSQL now contains data (post-cutover writes) that does not exist in Oracle.

Rollback Window Design

The rollback window is the period after cutover during which rollback is still feasible. Typical rollback window: 2–24 hours, depending on business tolerance for re-applying post-cutover writes.

To support rollback within the window:

  1. During cutover: Enable PostgreSQL-to-Oracle reverse replication (bidirectional CDC). Pulsaride Transform can configure a reverse CDC pipeline that captures PostgreSQL writes and applies them to Oracle.

  2. Rollback trigger: If a critical problem is detected during the window (data corruption, application failures, performance problems), the rollback procedure is: a. Stop application traffic to PostgreSQL b. Stop reverse replication c. Apply any PostgreSQL post-cutover writes to Oracle (from the reverse CDC log) d. Re-enable application traffic to Oracle e. Drop or archive the PostgreSQL instance

  3. After the window: Rollback requires accepting data loss (post-cutover writes that cannot be applied to Oracle) or a complex reconciliation effort. Beyond the rollback window, rollback is a business decision, not a technical one.

# pulsaride-transform: rollback window configuration
cutover:
  rollbackWindow:
    enabled: true
    durationHours: 4
    reverseReplication:
      enabled: true
      oracleTarget:
        host: oracle-primary.internal
        port: 1521
        service: MYAPP

Product note: The reverse replication pipeline is configured during the parallel running phase (before cutover) and is started at the moment of cutover. It adds latency to the rollback decision because every post-cutover write to PostgreSQL must also be queued for potential rollback application. This overhead is typically small (<5ms additional write latency) for the 2–4 hour rollback window.

13.5 Partial-Write Detection

A partial write is a write where some rows of a multi-row operation were committed but others were not. Partial writes occur when a worker process dies between commits. The staging ledger detects partial writes at the chunk level:

-- Detect chunks with unexpected row counts
SELECT
  fl.table_name,
  fl.chunk_id,
  fl.rows_loaded AS staging_count,
  fl.rows_expected AS expected_count,
  CASE
    WHEN fl.rows_loaded < fl.rows_expected THEN 'SHORT'
    WHEN fl.rows_loaded > fl.rows_expected THEN 'OVER'
    ELSE 'OK'
  END AS status
FROM _pulsaride.full_load_status fl
WHERE fl.status = 'COMPLETE'
AND fl.rows_loaded != fl.rows_expected;

rows_expected is calculated during the chunk plan step (before loading) using Oracle's COUNT(*) for the key range. A chunk marked COMPLETE with rows_loaded < rows_expected indicates a partial write — not all rows were committed before the completion marker was written.

Product note: Any chunk with rows_loaded != rows_expected is automatically re-queued for reload when the pipeline starts. The old data for that chunk's key range is deleted from the target before reload. This detection runs at every pipeline startup, ensuring that partial writes from any previous run are corrected before proceeding.

13.6 The Cutover Decision Point

The cutover is a one-way door (until rollback, which is itself a decision). The checklist that governs the cutover decision is covered in Chapter 19, but the transactional safety component is:

  • CDC lag < configured threshold (default: 30 seconds)
  • Zero chunks with rows_loaded != rows_expected
  • Zero ORPHANED_EVENT findings
  • Zero SCHEMA_CHANGE findings in _pulsaride.schema_changes that have not been applied
  • Reconciliation passing for all tables in the blocking tier (Chapter 15)
  • Rollback pipeline configured and tested

The pipeline enforces this checklist as a go/no-go gate. The cutover command will not proceed if any item fails.

13.7 Example: Recovery from a Worker Crash at Chunk 4,432

The full load is processing the transaction table (48 million rows, 9,600 chunks). Worker 3 crashes at 11:22 PM while committing chunk 4,432 (rows 22,155,001–22,160,000).

Pipeline restart at 11:30 PM:

-- Staging state at restart
SELECT chunk_id, status, rows_loaded, rows_expected
FROM _pulsaride.full_load_status
WHERE table_name = 'transaction'
AND chunk_id BETWEEN 4430 AND 4434;
chunk_id | status      | rows_loaded | rows_expected
---------|-------------|-------------|---------------
4430     | COMPLETE    | 5000        | 5000
4431     | COMPLETE    | 5000        | 5000
4432     | COMPLETE    | 3,847       | 5000          ← partial write
4433     | PENDING     | 0           | 5000
4434     | PENDING     | 0           | 5000

Chunk 4,432 is marked COMPLETE (the worker wrote the completion marker before the crash) but has only 3,847 of 5,000 expected rows — a partial write.

Restart actions:

  1. Detect chunk 4,432 as partial (rows_loaded != rows_expected)
  2. Delete the 3,847 rows in the range 22,155,001–22,160,000:
    DELETE FROM transaction
    WHERE transaction_id BETWEEN 22155001 AND 22160000;
    
  3. Reset chunk 4,432 to PENDING
  4. Resume loading from chunk 4,432

The restart completes chunk 4,432 correctly and continues with 4,433 and beyond. Total recovery time: 4 minutes (delete + re-load of 5,000 rows).

The event log records the crash, the detection, the delete, and the re-load:

SELECT event_type, detail->>'chunk_id' AS chunk, created_at
FROM _pulsaride.event_log
WHERE table_name = 'transaction'
AND created_at BETWEEN '23:20:00' AND '23:40:00'
ORDER BY created_at;
CHUNK_STARTED    | 4432 | 23:20:14
WORKER_CRASH     | 4432 | 23:22:07
PARTIAL_DETECTED | 4432 | 23:30:01
CHUNK_RESET      | 4432 | 23:30:02
CHUNK_STARTED    | 4432 | 23:30:03
CHUNK_COMPLETE   | 4432 | 23:31:44

The full audit trail is in the staging ledger. No engineer intervention was required beyond reviewing the event log at restart.

← Previous

Chapter 12Dependency Resolution at Runtime

Next →

Chapter 14Performance and Parallelism