Tech Reads

Migration from PostgreSQL v.9.6 to v.12.11 with minimum downtime

About a year and a half ago, we had the following situation: We used PostgreSQL version 9.6 as a database, which was hosted on AWS RDS. Even though…

JetRuby Agency

June 13, 2023

7 min read

About a year and a half ago, we had the following situation: We used PostgreSQL version 9.6 as a database, which was hosted on AWS RDS. Even though version 9.6 was quite old (PostgreSQL v.14 had already been released that year), we constantly postponed migration because this process was quite demanding. Everything was further complicated by the size of the database itself—about two terabytes.

At a certain point, there was no time to postpone the update: AWS RDS stopped supporting our base version. Within a specific time, we had to update our database version from 9.6 to 12.11; otherwise, AWS RDS would update it automatically after the deadline. We weren’t happy with this option, as it implies stopping the application, and in our case, it could take hours, which is entirely unacceptable to us.

Challenge with PostgreSQL

We faced the following challenge: migrating from PostgreSQL 9.6 to PostgreSQL 12.11 with minimum application downtime. The main difficulty was that the internal format of data representation on the hard drive was changed between the major versions of PostgreSQL, which led to a lack of backward compatibility between them. This means that when you update a major version of PostgreSQL, you must also update all the data. Accordingly, the more data in the database, the more work must be done when updating it.

Solution

Keeping that in mind, we started looking for a solution. The easiest way was to use the command pg_dump. This command allows you to take a snapshot of the database in text format, which can then be executed using the command psql.

For this purpose, a new base is created with an updated version in which the snapshot is completed. This approach, however, has a significant disadvantage – it takes a long time to work with large amounts of data. Since pg_dump captures the initial state at the beginning of making a snapshot, data in the database becomes outdated by the end of the process.

Moreover, once the snapshot is done and the data transfer has started, the application can no longer use the original database or the new one until the transfer process is completed. Thus, the application must be stopped both during the completion of the snapshot and during its application. As mentioned, the commands pg_dump and psql are pretty slow, which can lead to downtime.

Another preferable approach is to use the pg_upgrade command. It was explicitly created to migrate data between databases and update major versions most efficiently. Despite quite good performance in synthetic tests, we have decided not to use this command because it also requires the application to stop during the data migration, and there is no guarantee that the data migration will work out in time.

Having dropped these options, we decided to look at data migration using data replication. The general scheme of such migration is the following:

A new database (replica) with a new version is created;
The replica is connected to the source database (source or master);
Data is asynchronously migrated from the source to the replica;
When the data in the replica is close to the data in the master, the application is paused until the replica is fully synchronized.
At the end of step 4, the source and replica data are identical, so the application switches to the replica base.
After some time, having made sure that the new master is stable, you can remove the old database.

Replication is divided into physical and logical:

In the case of physical replication, data from master to replica is transferred in the same format in which it is stored on the drive;
In logical replication, data is transferred at a higher level, transferring objects and their changes rather than just physical data from the drive.

Because of this property, logical replication is cross-versioned, i.e., it allows you to link databases from different major versions (in our case, 9.6 and 12.11). Unfortunately, logical replication in PostgreSQL was added only in version 11, so we started looking for third-party replication solutions. Initially, we chose Bucardo because it had been previously used on our projects to migrate a database from Heroku to AWS RDS.

However, after some time studying Bucardo, we came across the pg_logical extension, which is very convenient for implementing logical replications and works on the versions we need. pg_logical works much faster than Bucardo, and it doesn’t require any triggers or third-party programs. This is the solution we finally settled on.

Setting up replication

Let’s look at how to set up replication using the pg_logical extension. First, you must create a new AWS RDS database as a replica. After that, you must configure the pg_logical extension for both the master and replica. To do this, you must configure the following parameters using Parameter groups in AWS RDS.

Both in the master and in the replica:

In the master:

Note that applying a new group of parameters or changing the current one will require restarting the database.

After that, we need to connect to the master and replica and configure replication.

To mark an object as a subject to replication, it must be added to the replication set. By default, three replication sets are created when creating an extension:

default. For tables in this set, all changes are replicated.
default_insert_only. Only inserts are replicated for tables in this set, which may contain tables without primary keys (more about that later).
ddl_sql. For tables in this set, only DDL changes are replicated, as specified by `pglogical.replicate_ddl_command`.

Adding to the set is done using a function:

Also, for convenience, there is a function that allows you to add all the tables in the scheme:

However, after running this function, we discovered that the schema_migrations table that Ruby on Rails created does not contain a primary key. The point is that replication is based on primary key data in WAL. For tables that don’t have a primary key, this is not possible, so for such tables, you need to change the writing mechanism in WAL using the following command:

Since this table does not contain a primary key and does not change, it can be added to the default_insert_only set:

You can use this script to check which tables do not have a primary key:

After that, you also need to add sequences to the replicated set:

After the master is set up, you must do that with the replica. First, you must ensure that the master and the replica have the same scheme. This can be done by making a snapshot of it:

After that, the pg_logical extension and replication must be configured in the replica:

After that, the replication should be started. You can check this with the following command:

This scenario worked without any problems during staging, but the replication eventually crushed in production:

Still, we couldn’t find anything meaningful in the logs except for “Connection Timeout.” We tried to change the order of operations: first, creating the replication and then adding tables to the replication set, but these manipulations did not help.

After researching the forums and manuals, it was suggested that this problem occurs because there is too much data in the replication queue simultaneously.

One of the recommendations is to remove all indexes from the replica scheme and add them again after most of the data is migrated. This will reduce the load on the replica because we won’t have to specify the index when writing into indexed columns. However, this technique by itself didn’t work. But then, accidentally, we discovered a working method.

Previously, we used to add all tables to the replicated set at once but decided to try adding them one at a time within a specific time interval (5-10 minutes). This approach allowed us to replicate all tables successfully.

After most of the data has been replicated, you only need to disable the application and switch the database. You can use different methods to check if all the data has been replicated. We have resorted to one of the simplest: simply counting all table rows and comparing the replica with the master. Since the data between the master and the replica is synchronized much faster than the database writes, synchronizing the master and the replica took almost no time at all.

Result

Despite some effort and complications, the pg_logical extension allowed us to upgrade a major version of PostgreSQL with minimal downtime (in our case, we had to stop the application twice; the total downtime was less than 5 minutes). This approach can be handy for legacy systems with large amounts of data.

Did you miss our last article? Follow our blog and stay in the loop regarding updates!

Technologies

Our Services

Our Solutions

Your industry, our tailored solutions

Migration from PostgreSQL v.9.6 to v.12.11 with minimum downtime

Challenge with PostgreSQL

You may also find interesting

Get Rid of Common Performance Bottlenecks in Databases

Next.js. How we switched from Incremental Static Regeneration to On-Demand Revalidation for static pages

Employ the 8D Approach in Root Cause Analysis (RCA)

Top-rated B2B software development company with a proven track record

Technologies

Services

Industries

Solutions

Media & Cases

Company