Netflix has described an internal automation platform that migrates Amazon RDS for PostgreSQL databases to Amazon Aurora PostgreSQL, reducing operational risk and downtime across nearly 400 production clusters. The system enables service teams to initiate migrations through a self-service workflow while enforcing replication validation, controlled cutover, change data capture coordination, and rollback safeguards.
Netflix routes database access through a platform-managed data access layer built on Envoy, which standardizes mutual TLS and abstracts database endpoints from application code. Because services do not directly manage credentials or connection strings, migrations must occur transparently beneath this layer. The automation, therefore, coordinates replication, validation, cutover, CDC handling, and rollback entirely at the infrastructure level.
Netflix engineers emphasized:
Our goal was to make RDS to Aurora migrations repeatable and low-touch, while preserving correctness guarantees for both transactional workloads and CDC pipelines.
The workflow begins by creating an Aurora PostgreSQL cluster as a physical read replica of the source RDS PostgreSQL instance using capabilities provided by Amazon Web Services. The replica is initialized from a storage snapshot and continuously replays write-ahead log records streamed from the source. During this phase, the system validates replication slot health, WAL generation rates, parameter compatibility, extension parity, and sustained replication lag under production traffic, ensuring the replica can sustain peak write throughput before cutover.

RDS to Aurora PostgreSQL Migration Workflow (Source: Netflix Blog Post)
For workloads using change data capture, including logical replication slots or downstream stream processors, the automation coordinates slot state before quiescence. CDC consumers are paused to prevent excessive WAL retention, and slot positions are recorded so that equivalent replication slots can be recreated on Aurora at the correct log sequence number after promotion. This preserves downstream consistency while avoiding WAL buildup that could increase replication lag.
An early adopter, Netflix's Enablement Applications team, migrated databases supporting device certification and partner billing workflows. During replication, engineers detected an elevated OldestReplicationSlotLag caused by an inactive logical replication slot retaining WAL segments and increasing replication lag. After removing the stale slot, replication converged, and migration completed successfully with post-cutover metrics matching pre-migration baselines.

Simplified Enablement Applications Overview (Source: Netflix Blog Post)
When replication lag approaches zero, the system enters a controlled quiescence phase. Security group rules are modified, and the source RDS instance is rebooted to block new connections at the infrastructure layer. After confirming that all in-flight transactions have been applied and that the Aurora replica has replayed the final WAL records, the replica is promoted to a writable Aurora cluster, and the data access layer routes traffic to the new endpoint.
According to Netflix engineers, rollback was treated as a first-class concern. Until promotion is finalized and traffic is fully shifted, the original RDS instance remains intact as the authoritative source. If validation checks fail during synchronization or if post-promotion health checks detect anomalies, traffic can be redirected back to the RDS cluster through the data access layer. Because applications are decoupled from physical endpoints, reverting the routing configuration restores the prior state without redeployment. CDC consumers can also resume from previously recorded slot positions on the original cluster if required.