Chapter 2

Chapter 2: Reading the Source System Before Touching It

Most migration failures begin before the first row is copied.

They begin when a team decides it already understands the source system.

That assumption usually looks harmless. A few schemas are identified. Some tables are classified as "business-critical." A conversion tool is run against DDL. A proof of concept succeeds on a narrow subset. Then the real migration begins and the hidden shape of the source system starts asserting itself:

a trigger writes into an audit table nobody inventoried
a sequence is used by three applications, not one
a package contains implicit key-generation logic
a child table is hotter than its parent
a CDC topic is dominated by one table that was classified as secondary
two foreign keys form a cycle the team only notices during replay

This chapter argues for a simple rule: before designing the target, read the source as if it were an unknown production system, not as if it were a neat schema model.

Pulsaride Transform treats source discovery as a first-class stage of the migration product because almost every later decision depends on it:

data type mapping
transform rules
load order
CDC strategy
replay logic
validation strategy
cutover readiness

The goal is not to collect metadata for its own sake. The goal is to reduce surprise. Every object you fail to inventory early becomes an outage candidate later.

2.1 Inventorying schemas, tables, views, packages, triggers, jobs, and sequences

The first deliverable of a migration program should not be converted DDL. It should be an inventory that answers three questions:

What exists?
What matters operationally?
What hides behavior rather than just structure?

Start with object families, not tables only

Teams often inventory tables and stop there. That is not enough.

For Oracle -> PostgreSQL migration, the minimum source inventory should cover:

schemas and owners
tables and partitions
views and materialized views
primary keys, unique constraints, foreign keys, and check constraints
indexes
sequences
triggers
packages, procedures, and functions
scheduler jobs
synonyms
database links if they influence runtime behavior

If your inventory does not include packages, triggers, jobs, and sequences, it is not a migration inventory. It is only a storage inventory.

Capture both metadata and operational hints

A useful inventory does not stop at object names. For each object, capture attributes that drive migration design:

object type
owner
creation and last DDL time
row count estimate
average row length
presence of LOBs
partitioning
dependencies
whether the object is read-only, write-heavy, or append-only
whether application logic depends on it synchronously

For example, a table with 50,000 rows may still be high-risk if:

it sits on a hot transaction path
it feeds many child tables
it is populated by triggers
it is used to allocate keys or statuses

Practical Oracle inventory queries

Start with object counts by schema:

SELECT owner, object_type, COUNT(*) AS object_count
FROM all_objects
WHERE owner IN ('APP_SCHEMA')
GROUP BY owner, object_type
ORDER BY owner, object_type;

Inventory tables with approximate volume:

SELECT owner,
       table_name,
       num_rows,
       blocks,
       avg_row_len,
       partitioned,
       temporary
FROM all_tables
WHERE owner = 'APP_SCHEMA'
ORDER BY num_rows DESC NULLS LAST;

Inventory sequences:

SELECT sequence_owner,
       sequence_name,
       min_value,
       max_value,
       increment_by,
       cache_size,
       last_number
FROM all_sequences
WHERE sequence_owner = 'APP_SCHEMA'
ORDER BY sequence_name;

Inventory triggers:

SELECT owner,
       trigger_name,
       table_name,
       triggering_event,
       trigger_type,
       status
FROM all_triggers
WHERE owner = 'APP_SCHEMA'
ORDER BY table_name, trigger_name;

Inventory PL/SQL objects:

SELECT owner,
       object_name,
       object_type,
       status
FROM all_objects
WHERE owner = 'APP_SCHEMA'
  AND object_type IN ('PACKAGE', 'PACKAGE BODY', 'PROCEDURE', 'FUNCTION')
ORDER BY object_type, object_name;

Inventory scheduler jobs:

SELECT owner,
       job_name,
       enabled,
       state,
       job_action,
       repeat_interval
FROM all_scheduler_jobs
WHERE owner = 'APP_SCHEMA'
ORDER BY job_name;

What to do with the results

Do not dump this into a spreadsheet and move on. Use it to classify objects into product-relevant buckets:

automatically convertible
structurally convertible but behaviorally risky
requires declarative transform rules
requires procedural rewrite
requires manual treatment or phased migration

This is where Pulsaride Transform becomes useful in a repeatable way. The inventory is not just documentation. It is the input to a migration pipeline.

Why this matters for CDC and reconciliation

The reconciliation-oriented architecture we examined already assumes that some tables will behave differently in migration:

some have parent-child dependencies
some depend on reference data already present in target
some are likely to generate waiting states
some will dominate change volume

If the source inventory misses those facts, the CDC design is blind from the first day.

2.2 Detecting unsupported Oracle patterns before migration starts

The worst time to discover an unsupported Oracle pattern is during replay, cutover rehearsal, or production switchover.

The second-worst time is after code generation has already baked bad assumptions into the target pipeline.

Unsupported or hard-to-port patterns must be detected before transform rules are finalized.

Unsupported does not always mean impossible

There are three categories:

Directly portable Straightforward table, index, and constraint patterns with known PostgreSQL equivalents.
Portable with redesign Objects or behaviors that can move, but not mechanically. Sequences, packages, triggers, and some index types usually land here.
Operationally unsupported Patterns that should block the migration plan until the team decides on a replacement strategy.

The real discipline is not technical purity. It is explicitness.

Patterns that should trigger early review

At minimum, flag:

packages with business logic
triggers that populate columns or cascade writes
autonomous transactions
dynamic SQL in PL/SQL
database jobs that mutate business tables
materialized views relied on by the application
function-based indexes tied to Oracle-specific functions
bitmap indexes on write-heavy tables
synonyms that hide ownership and coupling
LOB-heavy tables
tables without stable primary keys
surrogate keys populated outside the application

Each of these patterns increases the odds that a schema-only migration story is incomplete.

Example: a harmless trigger that is not harmless

Suppose a table INVOICE looks simple:

primary key
a few numeric and date columns
normal indexes

A schema conversion tool will happily recreate it in PostgreSQL.

But if Oracle has a BEFORE INSERT trigger that:

reads from a sequence
writes an audit row
normalizes a status field
calls a package function

then the table does not behave like its DDL suggests. If you miss that trigger in assessment, the target system may accept data but still fail business invariants.

Practical detection queries

Find invalid or nontrivial PL/SQL objects:

SELECT owner, object_name, object_type, status
FROM all_objects
WHERE owner = 'APP_SCHEMA'
  AND object_type IN ('PACKAGE', 'PACKAGE BODY', 'PROCEDURE', 'FUNCTION', 'TRIGGER')
ORDER BY status, object_type, object_name;

Search for Oracle-specific constructs in source code:

SELECT owner, name, type, line, text
FROM all_source
WHERE owner = 'APP_SCHEMA'
  AND (
       UPPER(text) LIKE '%PRAGMA AUTONOMOUS_TRANSACTION%'
    OR UPPER(text) LIKE '%EXECUTE IMMEDIATE%'
    OR UPPER(text) LIKE '%DBMS_JOB%'
    OR UPPER(text) LIKE '%DBMS_SCHEDULER%'
    OR UPPER(text) LIKE '%ROWNUM%'
    OR UPPER(text) LIKE '%CONNECT BY%'
  )
ORDER BY name, line;

Find tables without primary keys:

SELECT t.owner, t.table_name
FROM all_tables t
LEFT JOIN all_constraints c
  ON c.owner = t.owner
 AND c.table_name = t.table_name
 AND c.constraint_type = 'P'
WHERE t.owner = 'APP_SCHEMA'
  AND c.constraint_name IS NULL
ORDER BY t.table_name;

Product implication

A complete product must surface these findings as migration blockers or design decisions, not as buried assessment notes.

For Pulsaride Transform, this means the assessment stage should produce explicit outcomes such as:

"table can be auto-mapped"
"table requires sequence handling"
"table requires trigger behavior replacement"
"object excluded from wave 1"
"requires two-pass migration"
"requires manual procedural rewrite"

That is how unsupported patterns stop being surprises and become planned work.

2.3 Building a dependency graph from foreign keys and application flows

A source system is not just a set of tables. It is a graph.

This graph matters because migration order, CDC replay order, reconciliation interpretation, and cutover readiness all depend on it.

Foreign keys are necessary, but not sufficient

Most teams start with database foreign keys. That is correct, but incomplete.

The dependency graph should include two layers:

structural dependencies Derived from primary key / foreign key relationships
application-flow dependencies Derived from how the application creates and uses records in practice

Why both?

Because real systems often contain:

missing foreign keys in the physical schema
optional FKs that are mandatory in business practice
denormalized tables that depend on upstream events without formal constraints
reference tables loaded out of band

Why the graph matters to the product

The reconciliation-centered CDC architecture we reviewed already points to this need:

rows may enter WAITING_DEP when parent entities are not yet available
rows may enter WAITING_REF when reference data is assumed to exist in target
replay logic depends on knowing which writes unblock which pending records

Without a dependency graph, these behaviors become ad hoc conditionals. With a dependency graph, they become declarative product behavior.

Building the structural graph

Extract foreign key edges first:

SELECT a.owner,
       a.table_name,
       a.constraint_name,
       a.r_constraint_name,
       b.table_name AS referenced_table
FROM all_constraints a
JOIN all_constraints b
  ON a.r_constraint_name = b.constraint_name
 AND a.owner = b.owner
WHERE a.owner = 'APP_SCHEMA'
  AND a.constraint_type = 'R'
ORDER BY a.table_name, b.table_name;

Then extract column-level edges:

SELECT acc.owner,
       acc.table_name,
       acc.column_name,
       ac_r.table_name AS referenced_table
FROM all_cons_columns acc
JOIN all_constraints ac
  ON ac.owner = acc.owner
 AND ac.constraint_name = acc.constraint_name
JOIN all_constraints ac_r
  ON ac.owner = ac_r.owner
 AND ac.r_constraint_name = ac_r.constraint_name
WHERE ac.owner = 'APP_SCHEMA'
  AND ac.constraint_type = 'R'
ORDER BY acc.table_name, acc.position;

Adding application-flow edges

This is where assessment must leave the comfort of metadata.

Interview the application behavior and transaction flow:

Which tables are created in the same user action?
Which rows can legitimately arrive before the parent?
Which lookups are expected to be preloaded?
Which tables are event sinks rather than masters?
Which writes are idempotent and which are not?

For example:

ORDER_LINE depends structurally on ORDER
ORDER may depend structurally on CUSTOMER
ORDER may also depend operationally on a reference domain such as ORDER_STATUS
an audit table may depend on ORDER by application flow even if no FK exists

Turn the graph into product rules

The product should not stop at drawing the graph. It should derive behavior from it:

load ordering for full load
dependency rules for CDC resolution
replay triggers for child rows
validation grouping by business domain

This is exactly where Pulsaride Transform should remain declarative. The graph should feed the engine, not just the documentation.

2.4 Spotting high-risk tables: hot tables, wide rows, LOB-heavy tables, audit tables

Not all tables deserve equal treatment.

One of the fastest ways to make a migration brittle is to assume that every table fits the same extraction, CDC, replay, and validation pattern.

High-risk tables need special handling early.

Hot tables

A hot table is not necessarily the largest table. It is the table that changes often enough to shape CDC pressure, replay debt, and cutover risk.

Typical examples:

order headers
order lines
payment records
session or state tables
operational logs tied to business workflows

Hot tables matter because they influence:

Kafka partition skew
backlog growth
replay windows
lock contention in target
how meaningful lag metrics are

In the CDC design material we reviewed, one of the explicit operational concerns was hot partitions and the possibility of moving from a single topic to multiple topics when one table dominates the stream. That concern only exists if the assessment phase identifies hot tables early.

Wide rows

Wide rows hurt more than storage. They affect:

extraction bandwidth
staging storage
serialization cost
WAL volume in PostgreSQL
checksum cost
replay throughput

An update to one business field can end up rewriting a heavy row image repeatedly if the CDC payload is after-image oriented.

Inventory wide tables with:

SELECT owner,
       table_name,
       avg_row_len,
       num_rows
FROM all_tables
WHERE owner = 'APP_SCHEMA'
ORDER BY avg_row_len DESC NULLS LAST;

LOB-heavy tables

LOB-heavy tables deserve explicit classification because they often break optimistic throughput assumptions.

Watch for:

BLOB
CLOB
NCLOB
XML payloads
document or attachment tables

Find them with:

SELECT owner, table_name, column_name, data_type
FROM all_tab_columns
WHERE owner = 'APP_SCHEMA'
  AND data_type IN ('BLOB', 'CLOB', 'NCLOB', 'LONG', 'LONG RAW')
ORDER BY table_name, column_name;

These tables may require:

separate load strategy
lower batch size
different validation strategy
cutover isolation

Audit and history tables

Audit tables are deceptively tricky. They are often:

large
append-only
low business value for day-one cutover
high operational value for compliance and traceability

Treating them like ordinary transactional tables can waste migration effort and distort CDC capacity planning.

A product should be able to classify some tables as:

migrate in later wave
backfill only
summarize rather than replicate
exclude from near-real-time cutover path

Why high-risk classification should be productized

This is not just a project checklist. It belongs in the product because the same patterns repeat:

hot tables need a different CDC strategy
wide tables need different performance defaults
LOB-heavy tables need different batching
audit tables need different business priority

If Pulsaride Transform encodes none of this, the same analysis will be redone manually in every program.

2.5 Discovering cycles before runtime: fail-fast dependency validation

Cycles are where elegant migration diagrams go to die.

A team may believe it has a clean parent-child ordering, only to discover at runtime that two tables depend on each other directly or indirectly. At that point, retries do not help. The engine simply oscillates between blocked states.

This is why cycle detection belongs in assessment and startup validation, not in operations.

Why runtime discovery is too late

If the first sign of a cycle is:

rows piling up in WAITING_DEP
repeated replay attempts
unexplained cutover backlog

then the product has already failed the operator.

The right behavior is fail-fast:

build the dependency graph before consumption starts
detect cycles
refuse to start until a strategy is chosen

The v2 CDC architecture material already moves in exactly this direction by replacing runtime CYCLE_DETECTED handling with startup validation of dependency rules. That is the right product instinct.

What counts as a cycle

A cycle may be:

direct: A -> B and B -> A
indirect: A -> B -> C -> A
partly physical and partly logical: one edge from an FK, another from application flow

All three matter.

Practical cycle detection query

A quick Oracle query can reveal suspicious mutual foreign keys:

SELECT a.table_name,
       b.table_name AS referenced_table
FROM all_constraints a
JOIN all_constraints b
  ON a.r_constraint_name = b.constraint_name
WHERE a.constraint_type = 'R'
  AND a.owner = 'APP_SCHEMA'
  AND b.table_name IN (
      SELECT table_name
      FROM all_constraints
      WHERE constraint_type = 'R'
        AND r_constraint_name IN (
            SELECT constraint_name
            FROM all_constraints
            WHERE table_name = a.table_name
        )
  );

This is not a full graph algorithm, but it surfaces obvious mutual dependencies early.

The product should go further and run an actual graph traversal over the complete dependency model, including declarative dependency rules derived from the source schema and business flows.

What to do when a cycle exists

A cycle is not always a showstopper, but it is always a design decision.

Typical responses are:

make one side nullable temporarily
split migration into two passes
defer one FK population until after initial load
reclassify one edge as nonblocking for CDC

What matters is that this choice is deliberate and documented before runtime.

Why this belongs in the product

A migration product that cannot validate its dependency model before execution is incomplete.

Pulsaride Transform should treat dependency validation the way a compiler treats syntax errors:

discover early
explain precisely
block unsafe execution

That is cheaper than operational debugging by an order of magnitude.

2.6 Example: a migration blocked by a hidden mutual foreign key

Consider a simplified case with two tables:

ACCOUNT
ACCOUNT_PROFILE

At first glance, the migration plan seems obvious:

load ACCOUNT
then load ACCOUNT_PROFILE
replay changes with ACCOUNT as parent

But deeper inspection reveals:

ACCOUNT_PROFILE.ACCOUNT_ID references ACCOUNT.ID
ACCOUNT.DEFAULT_PROFILE_ID references ACCOUNT_PROFILE.ID

That is a mutual dependency.

What happens if the team misses it

The full load may appear to work if constraints are relaxed or load order is fudged.

But the CDC engine then starts showing a pathological pattern:

ACCOUNT_PROFILE changes wait for ACCOUNT
ACCOUNT changes wait for ACCOUNT_PROFILE
replay backlog grows
retries do nothing
operators see a healthy consumer but an unhealthy target

If the architecture exposes statuses like WAITING_DEP, the symptom becomes visible. But if the cycle is only discovered there, the damage is already operational.

What the product should have done earlier

During assessment:

detect the FK cycle
classify the pair as a two-pass migration candidate
require an explicit strategy before enabling CDC rules

During transform planning:

decide which edge is temporarily nullable or deferred
define the second-pass enrichment/backfill logic

During validation:

add targeted checks for rows missing the deferred relation

A better migration plan

One safe strategy is:

migrate ACCOUNT without enforcing DEFAULT_PROFILE_ID
migrate ACCOUNT_PROFILE
backfill ACCOUNT.DEFAULT_PROFILE_ID
validate missing links
only then enforce the final constraint posture

This is not as elegant as pretending the cycle does not exist. It is much more reliable.

The lesson

The point of source-system reading is not to gather trivia. It is to discover design-shaping facts while they are still cheap to act on.

A hidden mutual foreign key is a perfect example. If found during assessment, it becomes:

a declarative dependency rule
a two-pass migration plan
a known validation target

If found during cutover week, it becomes:

a backlog spike
a manual workaround
a trust problem

That difference is exactly why Pulsaride Transform must begin with source understanding rather than target enthusiasm.

Why This Matters in Practice

Source-system reading is where a migration product either earns trust or accumulates hidden debt.

Every later design choice depends on whether the source estate has been understood at the right level: not just as tables and columns, but as a live system with dependencies, hidden behaviors, hot paths, and procedural logic. Poor assessment quickly turns into schedule slippage, unplanned exceptions, and expensive late-stage redesign.

In other words, assessment quality determines whether the product looks disciplined or improvisational once the migration leaves the lab.

Where Generic Approaches Fall Short

Generic tools usually start from the metadata they can extract most easily:

tables
columns
types
constraints

That is useful, but incomplete. They often miss or underweight:

triggers that encode business behavior
sequences and implicit key-generation paths
packages and scheduler jobs that mutate data outside obvious transaction flows
operational hotness and table criticality
graph-level risk such as mutual dependencies and hidden cycles

The result is a shallow inventory that is good enough to generate output, but not good enough to govern a real migration program.

How Pulsaride Transform Helps

Pulsaride Transform should treat assessment as executable product input, not as a document-generation step.

That means the product should convert source discovery into structured migration decisions:

object classification: auto, risky, manual, phased
dependency graph extraction from foreign keys and application flows
high-risk table detection for hot, wide, and LOB-heavy tables
fail-fast dependency validation before runtime
explicit treatment of unsupported Oracle constructs as blockers or redesign items

This is also where the product becomes reusable. Once these patterns are captured declaratively, each new migration wave starts from a stronger baseline instead of repeating the same discovery work from scratch.

What This Means for Migration Teams

Strong source assessment reduces the most damaging category of migration cost: surprise.

For teams running a real migration, it improves outcomes in four ways:

better estimation before execution
fewer hidden scope expansions after kickoff
clearer boundary between product capability and manual work
more credible wave planning and cutover preparation

In practical terms, a solution that can say early "these 40 tables are straightforward, these 8 require two-pass treatment, these 3 need procedural redesign" is much easier to trust than a tool that looks generic until the exceptions start arriving.

← Previous

Chapter 1Why Transform Pulsride Exists

Chapter 3The Empty String Trap