Chapter 14

Performance and Parallelism

A migration that takes 6 weeks instead of 2 is not just a scheduling problem. It is a business exposure problem: every week the two systems run in parallel is a week of operational complexity, data synchronization risk, and engineer attention. Performance is not an optimization for comfort — it is a risk reduction strategy.

This chapter covers the performance levers available in a production migration pipeline: partition-based parallelism, connection management, the write path from application to PostgreSQL, advisory locks for coordination, and backpressure mechanisms that prevent the target from being overwhelmed.

14.1 The Performance Model

A migration pipeline has three potential bottlenecks:

Oracle read throughput: How fast can Oracle produce rows for the full load? Constrained by: Oracle I/O bandwidth, CPU for undo reconstruction (flashback reads), JDBC connection throughput.
Network throughput: How fast can rows travel from Oracle to the migration host and then to PostgreSQL? Constrained by: network bandwidth between Oracle and the pipeline host, and between the pipeline host and PostgreSQL.
PostgreSQL write throughput: How fast can PostgreSQL commit rows? Constrained by: I/O bandwidth (WAL write speed), CPU, connection overhead, constraint checking.

For most migrations, PostgreSQL write throughput is the binding constraint. Oracle reads are fast (especially with parallelism), networks are fast, but PostgreSQL's WAL must persist every write to disk before the commit returns.

The optimization strategies in this chapter target the PostgreSQL write bottleneck first, then address Oracle read performance.

14.2 Partition-Based Parallelism

The full load divides each table into chunks (Chapter 9). Multiple workers process chunks in parallel. The degree of parallelism — the worker count — is the primary throughput lever.

Worker Count Sizing

Each worker holds:

One JDBC connection to Oracle (reading)
One JDBC connection to PostgreSQL (writing)

A worker count of W requires W Oracle connections and W PostgreSQL connections. Connection overhead in PostgreSQL is significant — each connection consumes ~5MB of memory and spawns a backend process. For a migration pipeline with 20 workers, the PostgreSQL server needs 20 backend processes and ~100MB of connection overhead.

The optimal worker count depends on the number of tables being loaded in parallel and the I/O characteristics of both databases. Empirically:

For a target with 4 CPU cores and SSD storage:
  - 4 workers per CPU core ≈ 16 total workers
  - Each worker targeting 50MB/s write throughput ≈ 800MB/s total
  - With typical 100-byte average row size ≈ 8M rows/second

Start with 4 workers, measure throughput, and double until throughput stops increasing. Throughput that stops increasing at, say, 12 workers indicates the PostgreSQL I/O limit is hit.

Product note: Pulsaride Transform includes a benchmarkWorkers command that runs a calibration load on a sample table with varying worker counts and measures throughput. The output is a worker count recommendation and a throughput estimate for the full load.

Cross-Table Parallelism

Tables in the same load wave (Chapter 12) can be loaded in parallel. This is distinct from within-table chunk parallelism. For a wave with 40 tables and 20 workers:

20 workers are assigned to the 40 tables (2 tables per worker)
Each worker loads one chunk at a time from its assigned tables, alternating

This is simpler to implement than per-table worker pools and avoids the connection overhead of 20 workers × 40 tables = 800 connections.

Product note: Pulsaride Transform uses a work-stealing scheduler. Workers take chunks from a shared queue, which contains all pending chunks from all tables in the current wave, sorted by dependency-safe order. A worker that finishes its current chunk picks the next pending chunk from the queue. This naturally balances load — a worker assigned to a slow (LOB-heavy) table will eventually be reassigned to a faster table as its queue empties.

14.3 Bulk Insert vs COPY

PostgreSQL provides two primary paths for bulk data insertion:

INSERT ... VALUES (batch)

Standard JDBC batch inserts: collect N rows, send a single INSERT INTO ... VALUES (row1), (row2), ... statement, execute. Each INSERT statement commits as part of the chunk transaction.

// Java JDBC: batch insert
PreparedStatement ps = conn.prepareStatement(
    "INSERT INTO account (account_id, user_name, email, status, created_at) " +
    "VALUES (?, ?, ?, ?, ?)"
);
for (Row row : chunk) {
    ps.setLong(1, row.accountId);
    ps.setString(2, row.userName);
    ps.setString(3, row.email);
    ps.setString(4, row.status);
    ps.setTimestamp(5, row.createdAt);
    ps.addBatch();
}
ps.executeBatch();
conn.commit();

Throughput: typically 50,000–200,000 rows/second depending on row size and batch size.

COPY Protocol

PostgreSQL's COPY protocol is the fastest way to bulk-load data. It bypasses query parsing, planning, and per-row trigger firing. Data is streamed as a binary or CSV format directly into the table.

// Java PostgreSQL COPY via PgCopyOutputStream
CopyManager copyManager = new CopyManager((BaseConnection) conn);
String copyCommand = "COPY account (account_id, user_name, email, status, created_at) FROM STDIN WITH (FORMAT CSV)";

try (Writer writer = new OutputStreamWriter(copyManager.copyIn(copyCommand), StandardCharsets.UTF_8)) {
    for (Row row : chunk) {
        writer.write(
            row.accountId + "," +
            escapeCsv(row.userName) + "," +
            escapeCsv(row.email) + "," +
            row.status + "," +
            row.createdAt.toString() + "\n"
        );
    }
}

Throughput: typically 500,000–1,500,000 rows/second — 5–10× faster than batch INSERT.

Limitations of COPY:

Cannot use when the table has BEFORE INSERT triggers (triggers are skipped during COPY)
Cannot use when constraints must be enforced per-row during load (but the pre-load schema has no constraints — Chapter 6)
Requires the PostgreSQL superuser or pg_write_server_files role for file-based COPY (STDIN mode works for regular users)

Product note: Pulsaride Transform uses the COPY protocol for all full load writes by default. Tables with BEFORE INSERT triggers (migrated from Oracle trigger patterns — Chapter 8) are loaded using batch INSERT to preserve trigger semantics, unless the trigger has been eliminated in favor of a column default (the recommended approach).

14.4 WAL Configuration for Load Performance

PostgreSQL's Write-Ahead Log (WAL) persists every change before it is committed. This durability guarantee has a performance cost during bulk loading.

For the full load phase — where the pipeline can re-run any chunk from scratch if something fails — relaxing WAL durability is acceptable:

-- Session-level WAL settings for bulk load (not persistent)
SET synchronous_commit = OFF;
-- Commits return immediately; WAL flush happens asynchronously
-- Risk: last ~100ms of commits may be lost on crash (restartable from staging)

SET wal_compression = ON;
-- Compress WAL records, reducing I/O for large text/LOB columns

At the database level for the migration target:

# postgresql.conf during full load phase
checkpoint_completion_target = 0.9  # Spread checkpoints, reduce write spikes
max_wal_size = 4GB                  # Allow WAL to grow before checkpoint
wal_buffers = 64MB                  # Larger WAL buffer reduces I/O

After the full load, before CDC begins: Reset synchronous_commit to ON (or the production default). CDC requires durable commits — a lost commit during CDC is a data discrepancy.

14.5 Index and Constraint Deferral

Indexes slow every INSERT because the index must be updated along with the table. For bulk loading, dropping indexes before the load and recreating them after produces dramatically better throughput.

The pre-load schema (Chapter 6) has no indexes — only the table structure. This is the correct approach. The post-load DDL creates indexes using CREATE INDEX (blocking) or CREATE INDEX CONCURRENTLY (non-blocking, safer if CDC has started).

For very large tables, partial index creation during load can be used:

Load the table without any indexes
After load completes, create the primary key index first (required for FK resolution in other tables)
Create other indexes in parallel (multiple CREATE INDEX CONCURRENTLY sessions)

-- Parallel index creation (run in separate sessions)
-- Session 1:
CREATE INDEX CONCURRENTLY idx_account_email ON account (email);
-- Session 2:
CREATE INDEX CONCURRENTLY idx_account_status ON account (status);
-- Session 3:
CREATE INDEX CONCURRENTLY idx_account_created ON account (created_at);

Each session builds one index independently. PostgreSQL parallelizes within each index build using multiple workers (controlled by max_parallel_maintenance_workers).

14.6 Advisory Locks for Coordination

When multiple workers process chunks of the same table, they must not process the same chunk twice. The staging ledger provides the primary coordination mechanism (a chunk marked IN_PROGRESS is not picked up by another worker), but race conditions are possible if two workers read the same chunk as PENDING simultaneously.

PostgreSQL advisory locks provide a lightweight coordination mechanism:

-- Worker: try to acquire lock on chunk (table_hash + chunk_id)
SELECT pg_try_advisory_xact_lock(
  hashtext('account'),  -- table hash
  4432                   -- chunk_id
);
-- Returns TRUE if lock acquired (this worker owns the chunk)
-- Returns FALSE if another worker already holds it (skip this chunk)

Advisory transaction locks (xact_lock) are automatically released when the transaction ends. If a worker crashes mid-chunk, its advisory lock is released when the connection closes, allowing another worker to pick up the chunk on restart.

Product note: Pulsaride Transform combines advisory locks with the staging ledger for chunk coordination:

Try to acquire advisory lock for the chunk → if failed, skip
If acquired: update staging to IN_PROGRESS (within the same transaction)
Process chunk
Update staging to COMPLETE
Commit (releases advisory lock automatically)

The advisory lock prevents the double-processing race condition. The staging ledger provides persistent state across restarts.

14.7 Backpressure

A migration pipeline that reads Oracle faster than it can write to PostgreSQL will accumulate an in-memory queue of rows waiting to be written. If the queue grows without bound, the pipeline runs out of memory and crashes.

Backpressure is the mechanism by which the write side signals the read side to slow down.

Simple Backpressure: Queue Depth Limit

// Java: bounded blocking queue
BlockingQueue<List<Row>> chunkQueue = new ArrayBlockingQueue<>(maxQueueDepth);

// Reader thread: blocks when queue is full
chunkQueue.put(chunk);  // blocks if full

// Writer thread: processes chunks
List<Row> chunk = chunkQueue.take();  // blocks if empty

When the writer is slow (PostgreSQL write bottleneck), chunkQueue.put() blocks, which prevents the reader from reading more from Oracle. The Oracle connection remains open but idle.

Product note: Pulsaride Transform uses a per-table bounded queue with a configurable depth (default: 5 chunks). This limits in-memory buffering to 5 × chunk_size rows per table per worker. For 20 workers with chunk_size=10,000, maximum in-memory rows = 1,000,000 — manageable for most heap sizes.

Adaptive Backpressure

For pipelines that need finer control, adaptive backpressure measures write latency and reduces chunk size when latency exceeds a threshold:

If average write latency > 5s:
  chunk_size = chunk_size * 0.8   # reduce by 20%
If average write latency < 1s:
  chunk_size = chunk_size * 1.1   # increase by 10%
Bound: 1000 ≤ chunk_size ≤ 100000

This adapts the load rate to PostgreSQL's actual capacity rather than a fixed chunk size.

14.8 Example: Throughput Tuning for a 2-Billion-Row Schema

A schema has 2 billion total rows across 340 tables. The initial configuration (10 workers, chunk_size=10,000, batch INSERT) produces:

Throughput: 120,000 rows/second
Estimated completion: 2,000,000,000 / 120,000 = 16,667 seconds ≈ 4.6 hours (for full schema)

Tuning steps:

Step 1: Switch to COPY protocol Throughput jumps to 450,000 rows/second. Estimated: 1.2 hours.

Step 2: Increase workers from 10 to 20 Throughput: 810,000 rows/second. Estimated: 40 minutes.

Step 3: Increase chunk_size from 10,000 to 50,000 for non-LOB tables Throughput: 920,000 rows/second. Estimated: 36 minutes.

Step 4: Create large tables in their own wave and start them first The five largest tables (400M rows each, total 2B rows among them) are moved to a pre-wave that starts before the other 335 tables. The 335-table load completes in 8 minutes. The 5 large tables complete in 36 minutes. Total wall-clock: 36 minutes (parallel).

Final result: 36 minutes for 2 billion rows — from the original estimate of 4.6 hours. The optimization levers (COPY, workers, chunk size, wave scheduling) produce a 7.7× speedup.

Product note: This throughput (920,000 rows/second) is achievable on hardware typical for a production migration target: 16-core server, NVMe storage, 10Gbps network. Actual throughput varies with hardware and row size. The Pulsaride Transform benchmark command runs a calibration load that measures achievable throughput for the specific hardware configuration before committing to a timeline estimate.

← Previous

Chapter 13Transactional Safety and Rollback

Chapter 15Reconciliation-Driven Validation