---
id: 311f320b-9372-44ba-8c9f-5402be2bc0eb
---
# High replication delay in PostgreSQL service
---

This incident type refers to a high replication delay in a PostgreSQL service. Replication delay is the time it takes for a change made in the primary database to be replicated to the standby database. When the delay is abnormally high, it can indicate a problem with the replication process or the database itself. This can lead to data inconsistencies and other issues that can impact the performance and availability of the service. The incident usually requires investigation and troubleshooting to identify the root cause of the delay and to implement a solution to resolve the issue.

### Parameters
```shell
# Environment Variables

export HOST="PLACEHOLDER"

export USERNAME="PLACEHOLDER"

export DATABASE="PLACEHOLDER"

export STANDBY_SERVER="PLACEHOLDER"

```

## Debug

### Connect to the database server and run psql command
```shell
psql -h ${HOST} -U ${USERNAME} -d ${DATABASE}
```

### Check the replication status on the master
```shell
SELECT * FROM pg_stat_replication;
```

### Check the replication status on the standby
```shell
SELECT * FROM pg_stat_wal_receiver;
```

### Check the replication lag time on the standby
```shell
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag_time;
```

### Check the PostgreSQL logs for any errors related to replication
```shell
cat /var/lib/pgsql/logs/*.log | grep "replication"
```

### Check the PostgreSQL configuration file for any replication-related settings
```shell
cat /var/lib/pgsql/conf/postgresql.conf | grep "replication"
```

## Repair

### Restart the PostgreSQL service
```shell
sudo systemctl restart postgresql.service
```
### Restart the replication process by resetting the standby server to the latest checkpoint on the primary server. This can be done by stopping the standby server, removing all files in the PostgreSQL data directory, and starting the server again.
```shell
#!/bin/bash

# Stop the standby server

sudo systemctl stop ${STANDBY_SERVER}

# Remove all files in the PostgreSQL data directory

sudo rm -rf /var/lib/pgsql/data/*

# Start the standby server

sudo systemctl start ${STANDBY_SERVER}

```

### Verify that the standby server is up to date with the primary server by checking the WAL files on the standby server. If there are any discrepancies, restore the missing files from the primary server.
```shell

#!/bin/bash

# Check the WAL files on the standby server

standby_wal_files=`ssh $standby_server ls -1 /path/to/wal/files`

primary_wal_files=`ssh $primary_server ls -1 /path/to/wal/files`

# Find the missing WAL files on the standby server

missing_wal_files=`diff <(echo "$standby_wal_files") <(echo "$primary_wal_files") | grep "<" | sed 's/< //'`

if [ -n "$missing_wal_files" ]

then

  # Restore the missing WAL files from the primary server

  for file in $missing_wal_files

  do

    scp $primary_server:/path/to/wal/files/$file $standby_server:/path/to/wal/files/

  done

fi
echo "WAL files on standby server are up to date with primary server."

```


This incident type refers to a high replication delay in a PostgreSQL service. Replication delay is the time it takes for a change made in the primary database to be replicated to the standby database. When the delay is abnormally high, it can indicate a problem with the replication process or the database itself. This can lead to data inconsistencies and other issues that can impact the performance and availability of the service. The incident usually requires investigation and troubleshooting to identify the root cause of the delay and to implement a solution to resolve the issue.


This incident type refers to an issue with Redis replication, which means that there is a problem with the synchronization of data between Redis instances. This issue could impact the availability and performance of the system and may require immediate attention to restore the replication and ensure data consistency. The incident could be caused by various factors, such as network problems, hardware failures, or configuration issues. The incident must be investigated and resolved as soon as possible to avoid any data loss or downtime.


Redis replication broken incident.

This incident type occurs when there are too many locks acquired on a Postgresql database instance. This can cause issues with database performance and functionality. It may be necessary to adjust the max\_locks\_per\_transaction setting in Postgresql to prevent this issue from occurring.


Postgresql too many locks acquired

This incident type refers to a situation where the Postgresql database has a high rollback rate, which means that a high percentage of transactions are being aborted compared to the committed ones. This can cause issues with data consistency and performance, and may require investigation and resolution by the responsible team. The incident details may include information about the affected database instance, the service or system impacted, the urgency level, and any related alerts or escalations.


Postgresql high rollback rate incident

This incident type refers to a scenario where there is a high rate of statement timeouts in a Postgresql database instance. This can lead to degraded performance and potentially impact the availability of the database. It is important to quickly identify and address the underlying cause of the timeouts to ensure the stability of the system.


Postgresql high rate statement timeout incident.

This incident type occurs when the maximum number of connections allowed to a Postgres database has been reached. This prevents new connections from being established and can result in errors or downtime. It is important to monitor connection usage and adjust the maximum connection limit as needed to prevent this issue from occurring.


Postgres Connection Limit Reached

```shell
# Environment Variables

export HOST="PLACEHOLDER"

export USERNAME="PLACEHOLDER"

export DATABASE="PLACEHOLDER"

export STANDBY_SERVER="PLACEHOLDER"

```


### Connect to the database server and run psql command

```shell
psql -h ${HOST} -U ${USERNAME} -d ${DATABASE}
```

### Check the replication status on the master

```shell
SELECT * FROM pg_stat_replication;
```

### Check the replication status on the standby

```shell
SELECT * FROM pg_stat_wal_receiver;
```

### Check the replication lag time on the standby

```shell
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag_time;
```

### Check the PostgreSQL logs for any errors related to replication

```shell
cat /var/lib/pgsql/logs/*.log | grep "replication"
```

### Check the PostgreSQL configuration file for any replication-related settings

```shell
cat /var/lib/pgsql/conf/postgresql.conf | grep "replication"
```


### Restart the PostgreSQL service

```shell
sudo systemctl restart postgresql.service
```

### Restart the replication process by resetting the standby server to the latest checkpoint on the primary server. This can be done by stopping the standby server, removing all files in the PostgreSQL data directory, and starting the server again.

```shell
#!/bin/bash

# Stop the standby server

sudo systemctl stop ${STANDBY_SERVER}

# Remove all files in the PostgreSQL data directory

sudo rm -rf /var/lib/pgsql/data/*

# Start the standby server

sudo systemctl start ${STANDBY_SERVER}

```

### Verify that the standby server is up to date with the primary server by checking the WAL files on the standby server. If there are any discrepancies, restore the missing files from the primary server.

```shell

#!/bin/bash

# Check the WAL files on the standby server

standby_wal_files=`ssh $standby_server ls -1 /path/to/wal/files`

primary_wal_files=`ssh $primary_server ls -1 /path/to/wal/files`

# Find the missing WAL files on the standby server

missing_wal_files=`diff <(echo "$standby_wal_files") <(echo "$primary_wal_files") | grep "<" | sed 's/< //'`

if [ -n "$missing_wal_files" ]

then

  # Restore the missing WAL files from the primary server

  for file in $missing_wal_files

  do

    scp $primary_server:/path/to/wal/files/$file $standby_server:/path/to/wal/files/

  done

fi
echo "WAL files on standby server are up to date with primary server."

```


High replication delay in PostgreSQL service

Overview

Parameters

Debug

Connect to the database server and run psql command

Check the replication status on the master

Check the replication status on the standby

Check the replication lag time on the standby

Repair

Restart the PostgreSQL service

Restart the replication process by resetting the standby server to the latest checkpoint on the primary server. This can be done by stopping the standby server, removing all files in the PostgreSQL data directory, and starting the server again.

Verify that the standby server is up to date with the primary server by checking the WAL files on the standby server. If there are any discrepancies, restore the missing files from the primary server.

Learn more

Related Runbooks

Redis replication broken incident.

Postgresql too many locks acquired

Postgresql high rollback rate incident

Postgresql high rate statement timeout incident.

Support