Chapter 9

Full Load Before CDC

The full load is the act of copying all existing data from the Oracle source to the PostgreSQL target in a controlled, restartable, and auditable way. It is not a database dump. It is not a one-time ETL job. It is the first major milestone in a pipeline that must produce a consistent baseline — a snapshot of the source at a known point in time — that CDC can then continuously update until cutover.

This chapter covers why full load precedes CDC, how a production-grade full load is structured, and the failure modes that make ad-hoc approaches unreliable.

9.1 Why Full Load Before CDC

CDC (Change Data Capture) captures changes to the Oracle database by reading the redo log. It can deliver every INSERT, UPDATE, and DELETE that happens after a given log position (an SCN in Oracle). But it cannot retroactively deliver the rows that existed before that position.

This means the migration pipeline must establish a complete baseline of the source data before CDC begins — and it must record the SCN at which that baseline was taken, so CDC knows where to start replaying changes.

The sequence is:

Record the current Oracle SCN: v_start_scn = DBMS_FLASHBACK.GET_SYSTEM_CHANGE_NUMBER
Execute the full load, reading the source as it exists at v_start_scn (using Oracle's flashback query or a consistent cursor)
Record that the full load is complete
Start CDC from v_start_scn, replaying all changes that occurred after the snapshot

Without step 1 (the SCN anchor), the full load and CDC are disconnected — there is no point in the redo log where they join. Changes that happened during the full load would be missed. This gap would appear as corrupted or missing data in the target — rows that exist in Oracle but not in PostgreSQL, or rows in an intermediate state that the full load captured mid-update and CDC did not re-deliver.

Product note: Pulsaride Transform records the SCN anchor as the first action of every full load step. The anchor is stored in the staging ledger table alongside the full load job metadata. The CDC step reads this SCN at startup and begins consuming the redo log from that position, not from the current time.

9.2 Chunked Cursor: The Core Load Pattern

A full load of a large table cannot be executed as a single query that reads all rows into memory. A table with 100 million rows and an average row size of 500 bytes contains 50 GB of data. Reading this into application memory would fail. Writing it in a single database transaction would hold locks that interfere with Oracle's ability to generate redo for CDC.

The production-grade approach is a chunked cursor: divide the table into non-overlapping ranges, load each range as a separate bounded query, and commit each range independently.

-- Oracle source: chunked read by primary key range
SELECT /*+ NO_PARALLEL */ account_id, user_name, email, status, created_at
FROM account
WHERE account_id BETWEEN :min_id AND :max_id
ORDER BY account_id;

The chunks are defined by primary key ranges:

Chunk 1: account_id BETWEEN 1       AND 100000
Chunk 2: account_id BETWEEN 100001  AND 200000
Chunk 3: account_id BETWEEN 200001  AND 300000
...

Each chunk is:

Queried from Oracle using a bounded range
Streamed to the PostgreSQL target using COPY or batch INSERT
Committed in PostgreSQL
Recorded as "loaded" in the staging ledger

Flashback Consistency

For the full load to be consistent, all chunks must read the source as it appeared at the same SCN — the anchor SCN recorded in step 1. Without this, a chunk that reads account while another application is modifying it might capture a row in a mid-update state.

Oracle's flashback query achieves this:

-- Oracle: read as-of a specific SCN
SELECT account_id, user_name, email, status, created_at
FROM account AS OF SCN :anchor_scn
WHERE account_id BETWEEN :min_id AND :max_id
ORDER BY account_id;

The AS OF SCN clause reads the table as it appeared at that SCN, using undo data to reconstruct any rows that were modified after that point. This works as long as Oracle has not yet discarded the undo data for that SCN — typically valid for several hours to a few days depending on UNDO_RETENTION.

Product note: Pulsaride Transform issues all full load queries with AS OF SCN pointing to the recorded anchor SCN. If Oracle cannot serve undo data for the anchor SCN (because UNDO_RETENTION is too short for a slow load), the full load fails with ORA-30052: invalid lower limit snapshot expression. The resolution is either to increase UNDO_RETENTION or to restart the full load with a new anchor SCN — not to continue without the SCN anchor.

9.3 Ordering by Primary Key

Chunked loads must partition the data by a column that is:

Monotonically increasing (or at least well-distributed)
Indexed (so the range scan is efficient)
Stable (the value does not change during the load)

The primary key is almost always the correct choice. For sequence-backed integer primary keys, the distribution is near-uniform and the range scan is a B-tree index range scan.

For tables without an integer primary key, alternatives:

Composite PK: Choose the leading column as the partition key
UUID PK: Partition by the first byte(s) of the UUID for roughly uniform distribution
No single-column partition key: Use Oracle's ROWNUM or ROWID-based partitioning (using DBMS_PARALLEL_EXECUTE chunk ranges)

-- ROWID-based chunking for tables without a suitable partition key
SELECT chunk_id, start_rowid, end_rowid
FROM (
  SELECT rownum chunk_id,
         dbms_parallel_execute.build_chunks(
           'SELECT rowid FROM target_table',
           chunk_size => 10000
         )
  FROM dual
);

Product note: During assessment, Pulsaride Transform analyzes each table to determine the partition strategy. For tables with no suitable partition key, the assessment report flags the table as PARTITION_STRATEGY_MANUAL and requires the operator to specify the partitioning approach in the pipeline configuration before the full load can proceed.

9.4 LOB Streaming

Large Object (LOB) columns — CLOB, NCLOB, BLOB — cannot be read in the same way as scalar values. Oracle LOBs have a LOB locator (a pointer stored in the table row) and the LOB data stored in a LOB segment. Reading a LOB requires fetching the locator and then reading the data through the locator.

For migration, LOBs are read row-by-row in a streaming fashion:

// Java: streaming Oracle CLOB to PostgreSQL TEXT
ResultSet rs = stmt.executeQuery(
    "SELECT account_id, document_clob FROM account_document " +
    "WHERE account_id BETWEEN ? AND ?"
);
while (rs.next()) {
    long accountId = rs.getLong(1);
    Clob clob = rs.getClob(2);
    
    // Stream CLOB content to PostgreSQL
    try (Reader reader = clob.getCharacterStream()) {
        pgStmt.setLong(1, accountId);
        pgStmt.setCharacterStream(2, reader, (int)clob.length());
        pgStmt.addBatch();
    }
}
pgStmt.executeBatch();

For BLOBs, replace getClob/getCharacterStream with getBlob/getBinaryStream.

LOB Inline vs Out-of-Line Storage

Oracle stores small CLOBs inline in the table row (up to 4000 bytes, or 32767 bytes if MAX_STRING_SIZE = EXTENDED). Larger CLOBs are stored out-of-line in a LOB segment. For migration, the inline vs out-of-line distinction is invisible — Oracle's JDBC driver handles both through the same LOB locator API.

In PostgreSQL, text values up to 8KB are stored inline in the heap tuple. Larger values are stored using TOAST (The Oversized-Attribute Storage Technique), which is also transparent to the application. No special handling is needed for large text values in PostgreSQL.

Product note: Pulsaride Transform detects LOB columns during assessment using DBA_TAB_COLUMNS.DATA_TYPE IN ('CLOB', 'NCLOB', 'BLOB') and configures a streaming LOB reader for those columns. LOB chunks are sized independently from scalar column chunks (typically 1,000 rows per chunk for LOB-heavy tables, vs 100,000 for scalar-only tables). The chunk size configuration is exposed per-table in the pipeline YAML.

9.5 Restart and Checkpoint Design

A full load of a 500-table schema with 2 billion total rows will take hours. It will not complete without something going wrong — a network interruption, an Oracle undo exhaustion, a PostgreSQL connection timeout, or a deployment restart. The full load must be restartable without re-reading data that was already successfully written.

The checkpoint design:

-- PostgreSQL: staging ledger table (created by pulsaride-transform)
CREATE TABLE _pulsaride_full_load_status (
  table_name      TEXT NOT NULL,
  chunk_id        BIGINT NOT NULL,
  min_key         TEXT,
  max_key         TEXT,
  rows_loaded     BIGINT NOT NULL DEFAULT 0,
  status          TEXT NOT NULL DEFAULT 'PENDING', -- PENDING, IN_PROGRESS, COMPLETE, FAILED
  started_at      TIMESTAMP WITH TIME ZONE,
  completed_at    TIMESTAMP WITH TIME ZONE,
  error_message   TEXT,
  PRIMARY KEY (table_name, chunk_id)
);

When the full load starts (or restarts):

For each table, query _pulsaride_full_load_status for chunks with status COMPLETE
Skip those chunks — their data is already in the target
Process remaining chunks (status PENDING or FAILED)
Mark each chunk as IN_PROGRESS at start, COMPLETE at end

This design means a restart from failure picks up exactly where it left off, at chunk granularity. A table with 1,000 chunks that failed at chunk 732 restarts from chunk 733.

-- Restart query: find pending chunks for a table
SELECT chunk_id, min_key, max_key
FROM _pulsaride_full_load_status
WHERE table_name = 'account'
AND status IN ('PENDING', 'IN_PROGRESS', 'FAILED')
ORDER BY chunk_id;

Note: IN_PROGRESS chunks are retried on restart. A chunk in IN_PROGRESS state when a restart occurs means the process died while loading that chunk. The partial data for that chunk in the target must be deleted before re-loading:

-- Before re-loading a chunk: delete any partial data
-- (chunk key range is stored in min_key/max_key)
DELETE FROM account
WHERE account_id >= :min_key::BIGINT
AND account_id <= :max_key::BIGINT;

This delete-before-retry is idempotent: if the chunk is empty (because the partial write was rolled back), the delete does nothing. If partial data exists, it is cleared before the re-load.

9.6 Full Load Monitoring

A full load running overnight must be observable. The staging ledger provides the monitoring surface:

-- Full load progress: rows loaded vs estimated total
SELECT
  t.table_name,
  COUNT(CASE WHEN s.status = 'COMPLETE' THEN 1 END) AS chunks_done,
  COUNT(*) AS chunks_total,
  SUM(CASE WHEN s.status = 'COMPLETE' THEN s.rows_loaded ELSE 0 END) AS rows_done,
  ROUND(
    100.0 * COUNT(CASE WHEN s.status = 'COMPLETE' THEN 1 END) / COUNT(*),
    1
  ) AS pct_complete
FROM _pulsaride_full_load_status s
JOIN _pulsaride_table_metadata t ON t.table_name = s.table_name
GROUP BY t.table_name
ORDER BY pct_complete DESC;

Product note: The pulsaride-monitoring module exposes a /pulsaride/dashboard endpoint that renders a live full load progress view. Tables are color-coded: green (complete), blue (in-progress), yellow (pending), red (failed). The dashboard auto-refreshes every 10 seconds. Each table row shows estimated completion time based on the average rows-per-second rate of completed chunks.

9.7 Example: The `transaction` Table Full Load

The transaction table has 48 million rows, a sequence-backed integer primary key, and two CLOB columns (memo_text, statement_description). It is the largest table in the schema.

Full load configuration:

fullLoad:
  tables:
    - name: transaction
      partitionColumn: transaction_id
      chunkSize: 5000          # reduced from default 50000 due to CLOBs
      parallelWorkers: 4
      lobColumns:
        - column: memo_text
          targetType: TEXT
        - column: statement_description
          targetType: TEXT
      excludeColumns: []

The load runs with 4 parallel workers, each handling a disjoint range of transaction IDs. Each worker loads chunks of 5,000 rows. Total chunks: 48,000,000 / 5,000 = 9,600. With 4 workers and an observed rate of 2,500 rows/second per worker (10,000 rows/second total), the estimated completion time is:

48,000,000 / 10,000 = 4,800 seconds ≈ 1 hour 20 minutes.

At hour 1:05, worker 2 fails with ORA-30052: invalid lower limit snapshot expression — the undo for the anchor SCN has been discarded. The other three workers are still running.

Resolution:

The operator stops the full load
Checks the staging ledger: 8,932 of 9,600 chunks are COMPLETE (93.0%)
The anchor SCN is invalid, so flashback queries cannot resume
Option A: Restart with a new anchor SCN. Completed chunks (93%) are cleared and re-loaded from scratch — this restarts the entire load.
Option B: For the remaining 7% of chunks, run without AS OF SCN (accepting that these chunks will capture live data that may differ from the original snapshot). CDC replay will overwrite any inconsistencies.

The operator chooses Option B for the 668 remaining chunks, accepting the small consistency risk because CDC lag is only 2 hours and will reconcile these rows quickly. The full load completes 14 minutes later.

Product note: Pulsaride Transform offers a continueWithoutFlashback option for exactly this scenario. When enabled, chunks that cannot use AS OF SCN fall back to live reads. The staging ledger marks these chunks as LOADED_WITHOUT_FLASHBACK, and the reconciliation step applies extra scrutiny to them — comparing their hash against the Oracle source after CDC has caught up.

← Previous

Chapter 8PL/SQL to PL/pgSQL

Chapter 10CDC Design for Migration