Chapter 12

Dependency Resolution at Runtime

The dependency graph that was built during assessment (Chapter 2) serves two purposes: it orders DDL creation during schema migration, and it governs the load order during the full load. A full load that ignores table dependencies will attempt to load a child table before its parent exists, violating foreign key constraints in PostgreSQL and failing with a constraint violation error.

This chapter covers how the dependency graph is used at runtime, how cycles in the graph are detected and resolved, and how load waves enable parallel loading without dependency violations.

12.1 The Dependency Graph at Runtime

During assessment, Pulsaride Transform builds a directed graph where each node is a table and each edge is a foreign key: an edge from A to B means "B has a foreign key that references A" (A must be loaded before B).

product ─────────────────────────────┐
customer ──────────────┐             │
address ──► customer   │             │
                       ▼             ▼
                  order_header ──► order_line
                       │
                       ▼
                  order_payment

In this graph:

product has no dependencies (no FKs) — it can be loaded first
customer has no dependencies — it can be loaded in parallel with product
address depends on customer
order_header depends on customer
order_line depends on both order_header and product
order_payment depends on order_header

A topological sort of this graph produces a valid load order. The most common algorithm is Kahn's algorithm: repeatedly remove nodes with no incoming edges (no unfulfilled dependencies), loading them in the current wave, then re-evaluate the remaining graph.

Wave 1: product, customer     (no dependencies)
Wave 2: address, order_header (depend only on Wave 1 tables)
Wave 3: order_line, order_payment (depend on Wave 2 tables)

Within each wave, tables can be loaded in parallel. Across waves, the pipeline must wait for the previous wave to complete before starting the next.

Product note: Pulsaride Transform computes the wave schedule during assessment and stores it in the pipeline configuration. Each table is assigned a wave number. The full load orchestrator executes waves sequentially, launching all tables in a wave as parallel workers. The YAML configuration shows the computed wave assignment:

fullLoad:
  waves:
    - wave: 1
      tables: [product, customer]
    - wave: 2
      tables: [address, order_header]
    - wave: 3
      tables: [order_line, order_payment]

12.2 Cycle Detection

Not all Oracle FK graphs are acyclic. Cycles appear when two tables reference each other — either directly or through a chain of tables.

Direct Cycles

-- Oracle: mutual FK between account and account_profile
ALTER TABLE account ADD CONSTRAINT fk_account_primary_profile
  FOREIGN KEY (primary_profile_id) REFERENCES account_profile (profile_id);

ALTER TABLE account_profile ADD CONSTRAINT fk_profile_account
  FOREIGN KEY (account_id) REFERENCES account (account_id);

account depends on account_profile (FK for primary_profile_id) and account_profile depends on account (FK for account_id). Neither can be loaded before the other — a true cycle.

Indirect Cycles

A → B → C → A

Table A has a FK to B, B has a FK to C, C has a FK to A. No table can be loaded first.

Detecting Cycles

Cycle detection uses depth-first search on the dependency graph. When DFS encounters a back edge (an edge to a node that is already in the current DFS path), a cycle exists. The cycle path identifies which tables are involved.

# Cycle detection pseudocode
def find_cycles(graph):
    visited = set()
    path = []
    cycles = []
    
    def dfs(node):
        if node in path:
            cycle_start = path.index(node)
            cycles.append(path[cycle_start:] + [node])
            return
        if node in visited:
            return
        path.append(node)
        for neighbor in graph[node]:
            dfs(neighbor)
        path.pop()
        visited.add(node)
    
    for node in graph:
        dfs(node)
    return cycles

Product note: Pulsaride Transform runs cycle detection during assessment and reports all cycles as FK_CYCLE findings. The finding includes the full cycle path (e.g., account → account_profile → account) and the specific FK constraints that form the cycle.

12.3 Resolving Cycles: The Deferred FK Strategy

Cycles cannot be resolved through load ordering alone — no valid topological order exists for a cyclic graph. The resolution is to break the cycle by deferring one FK in the cycle.

The choice of which FK to defer is guided by:

Which FK is more naturally nullable? If account.primary_profile_id allows NULL (a new account before a profile is assigned), defer the account→account_profile FK and load accounts first with primary_profile_id = NULL, then load account_profiles, then update primary_profile_id.
Business rule analysis: Which direction of the relationship is fundamental and which is derived? Defer the derived FK.

Deferred FK Load Strategy

For the account ↔ account_profile cycle:

-- Phase 1: Load account without primary_profile_id
INSERT INTO account (account_id, user_name, email, primary_profile_id)
SELECT account_id, user_name, email, NULL  -- defer FK column
FROM oracle_account_staging;

-- Phase 2: Load account_profile (FK to account now satisfied)
INSERT INTO account_profile (profile_id, account_id, profile_type)
SELECT profile_id, account_id, profile_type
FROM oracle_account_profile_staging;

-- Phase 3: Update primary_profile_id back (now account_profile exists)
UPDATE account a
SET primary_profile_id = (
  SELECT profile_id FROM account_profile ap
  WHERE ap.account_id = a.account_id
  AND ap.is_primary = TRUE
  LIMIT 1
);

This requires that primary_profile_id be NULLable in the target schema during the load. After the update, it can be made NOT NULL (if needed) and the FK constraint applied.

Product note: Pulsaride Transform generates a three-phase load plan for each detected cycle:

Load tables with FK columns set to NULL
Load the referenced tables
Apply a backfill UPDATE to restore the FK values

The backfill UPDATE is generated from the Oracle source data (joining the staging tables) and is applied as a post-load step before the FK constraint is created.

fullLoad:
  cycleResolution:
    - cycle: [account, account_profile]
      deferredFk: account.primary_profile_id
      backfillQuery: |
        UPDATE account a
        SET primary_profile_id = ap.profile_id
        FROM account_profile ap
        WHERE ap.account_id = a.account_id
        AND ap.is_primary = TRUE

12.4 Parallel Load Wave Planning

Wave planning optimizes the load schedule for elapsed time. A naive wave plan might be:

Wave 1: 200 independent tables (loaded in 2 hours with 20 workers)
Wave 2: 100 tables depending on Wave 1 (loaded in 1 hour with 20 workers)
Wave 3: 50 tables depending on Wave 2 (loaded in 45 minutes with 20 workers)

Total: 3 hours 45 minutes.

But if Wave 1 has one table with 500 million rows that takes 4 hours by itself, the entire wave is blocked on that table while the other 199 tables finish in 30 minutes. The wave-level parallelism is wasted waiting.

A smarter wave plan breaks the critical path large table out of the wave and starts it earlier:

Pre-wave: transaction (500M rows, 4 hours) — started immediately
Wave 1: remaining 199 independent tables (30 minutes with 20 workers)
Wave 2: tables depending on Wave 1 (excluding transaction dependents)
Wave 3: transaction dependents (after transaction finishes, ~4 hours)

Product note: Pulsaride Transform includes a wave optimizer that identifies the critical path — the sequence of tables with the largest cumulative load time. Tables on the critical path are scheduled as early as possible, even if this means a table starts before all of its wave-peers are ready. The optimizer accounts for estimated row counts (from Oracle's DBA_TABLES.NUM_ROWS) and measured load throughput from test runs.

12.5 Runtime Dependency Enforcement

Even with a correct wave plan, runtime failures can cause out-of-order loads. If Wave 2 starts before Wave 1 is verified complete (e.g., due to a pipeline bug), FK violations will occur.

The runtime enforcer checks dependency state before starting each table:

-- Before loading order_header (which depends on customer):
-- Verify all dependencies are COMPLETE
SELECT COUNT(*) FROM _pulsaride.full_load_status
WHERE table_name = 'customer'
AND status != 'COMPLETE';
-- Returns 0 → safe to start order_header
-- Returns > 0 → dependency not satisfied, block

This check runs per-table at load start, not just per-wave. It prevents a partially-failed Wave 1 from allowing Wave 2 to start for the tables whose dependencies are satisfied, while blocking Wave 2 for tables whose dependencies failed.

12.6 CDC Dependency Order

During CDC, changes arrive in commit order — the order in which transactions committed in Oracle. This is not the same as the FK dependency order. A CDC event stream might deliver:

SCN 1000: INSERT order_line (order_id=501, line_id=1001)
SCN 1001: INSERT order_header (order_id=501)

The order_line INSERT arrived before the order_header INSERT even though FK says order_header must exist first. This is possible in Oracle because both were part of the same transaction — the INSERT order within a transaction does not matter for FK checking in Oracle (which checks constraints at statement time by default, but deferred constraints check at commit).

In PostgreSQL, applying order_line before order_header would violate the FK constraint unless the constraint is deferred.

Product note: Pulsaride Transform applies CDC events within a transaction in dependency order. When a batch of CDC events arrives from the same Oracle transaction, the pipeline re-sorts them by dependency order before applying: parent rows before child rows. This prevents transient FK violations during CDC apply.

For deferrable FK constraints, the CDC apply session sets SET CONSTRAINTS ALL DEFERRED to allow the entire transaction to be applied before constraint checking.

12.7 Example: The Five-Table Cycle in a Manufacturing Schema

A manufacturing schema has a five-table cycle:

work_order → operation → resource_plan → cost_center → department → work_order

Assessment detection:

FK_CYCLE: work_order → operation → resource_plan → cost_center → department → work_order
Cycle length: 5 tables
FK involved in cycle: department.work_order_id → work_order(work_order_id)
Recommendation: Defer FK department.work_order_id (nullable column)

The department.work_order_id column represents "the current active work order for this department" — it is NULL for departments with no active work order and is updated by the application as work orders are created and closed. It is the most naturally NULLable FK in the cycle.

Cycle resolution:

Wave 1: cost_center (no other dependencies)
Wave 2: department (load with work_order_id = NULL to break cycle)
Wave 3: work_order, operation, resource_plan
Wave 4 (backfill): UPDATE department.work_order_id from Oracle source

The backfill query:

UPDATE department d
SET work_order_id = o.work_order_id
FROM oracle_department_staging o
WHERE d.department_id = o.department_id
AND o.work_order_id IS NOT NULL;

After backfill: 847 of 4,200 departments have a non-NULL work_order_id. These are the departments with an active work order at the time of the snapshot. FK constraint applied. No violations.

← Previous

Chapter 11Staging as Product Memory

Chapter 13Transactional Safety and Rollback