MySQL Database Service inbound replication channel troubleshooting

Using a MySQL Database Service instance in OCI as an asynchronous replica is very useful. It allows testing the service with always updated data from on-premise production. It can be used in the process of migrating with minimal downtime to OCI and finally, it can also be used between MDS instances on different regions for DR, analytics, etc…

Here are some links to the documentation and other resources about inbound replication channel and how to use it:

In this article, we consider that you already have everything ready to replicate from the source to a MDS instance. This means that the network part (VCNs, eventual VPN, routing, security list) and the replication user and credentials have been configured.

We will consider errors like described in this schema:

But before checking what kind of error we can encounter, let’s see how we can verify the status of the inbound replication channel.

Inbound Replication Channel Status

The first source to see if the replication channel is running, is the OCI MDS console:

We can also use the oci sdk to get that info. For example, from the cloud shell, you can get the status using the channel’s ocid:

$ oci mysql channel get --channel-id <channel_ocid> | grep lifecycle-state
  "lifecycle-state": "NEEDS_ATTENTION"

Let’s see an example:

However, we have no idea of what is wrong, but we know we need to check.

Please note that the status on the console is not updated in real-time and can take up to 10mins before noticing a problem. I encourage you to check the MySQL Replication Channel status directly from the replica instance itself (more on that later).

How to know what is the problem ?

The only way to see what is the problem, is to connect on the MDS replica instance and verify using our client. I use MySQL Shell.

Usually replication problems are reported in 4 different locations:

output of SHOW REPLICA STATUS\G (which is the most commonly used but not recommended)
performance_schema.replication_connection_status
performance_schema.replication_applier_status_by_coordinator and performance_schema.replication_applier_status_by_worker
MySQL’s error log (available using performance_schema.error_log in MDS)

This is an example:

 SQL > select * from performance_schema.error_log 
       where SUBSYSTEM = 'Repl' and PRIO = 'Error' 
       order by 1 desc limit 1\G
*************************** 1. row ***************************
    LOGGED: 2021-09-23 08:37:22.602686
 THREAD_ID: 51
      PRIO: Error
ERROR_CODE: MY-010584
 SUBSYSTEM: Repl
      DATA: Slave SQL for channel 'replication_channel': Worker 1 failed executing transaction 'b545b0e8-139e-11ec-b9ee-c8e0eb374015:35' at master log binlog.000008, end_log_pos 32561; Could not execute Write_rows event on table rss.ttrss_feeds; Duplicate entry '2' for key 'ttrss_feeds.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binlog.000008, end_log_pos 32561, Error_code: MY-001062

We will cover the following issues:

Network Issues
Credentials Issues
No dataset definition
Missing Binlogs
Other Replication Problems related to statements

Now let’s focus on the first category of Inbound replication issues.

Replication Channel does not start

When you have created the replication channel and it immediately complain, it’s usually caused by one of these two issues:

the replica cannot connect to the source due to a network issue
the replica cannot connect to the source due to wrong credentials

Network Issue

If you have a network issue, meaning that MDS is not able to connect on the host and port you defined when creating the inbound replication channel. The only way to fix this is to control the security list and eventual routing.

If we check the 4 sources of information we will see the following information:

From SHOW REPLICA STATUS\G (last time I use it!):

 SQL > show replica status\G
*************************** 1. row ***************************
             Replica_IO_State: Connecting to source
                  Source_Host: 172.27.240.3
                  Source_User: repl_mds
                  Source_Port: 3306
                Connect_Retry: 60
                           ...
           Replica_IO_Running: Connecting
          Replica_SQL_Running: Yes
                           ...
                Last_IO_Errno: 2003
                Last_IO_Error: error connecting to master 'repl_mds@172.27.240.3:3306' - retry-time: 60 retries: 2 message: Can't connect to MySQL server on '172.27.240.3:3306' (113)

And from Performance_Schema table:

 SQL > select * from performance_schema.replication_connection_status\G
*************************** 1. row ***************************
                                      CHANNEL_NAME: replication_channel
                                        GROUP_NAME: 
                                       SOURCE_UUID: 
                                         THREAD_ID: NULL
                                     SERVICE_STATE: CONNECTING
                         COUNT_RECEIVED_HEARTBEATS: 0
                          LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
                          RECEIVED_TRANSACTION_SET: 
                                 LAST_ERROR_NUMBER: 2003
                                LAST_ERROR_MESSAGE: error connecting to master 'repl_mds@172.27.240.3:3306' - retry-time: 60 retries: 6 message: Can't connect to MySQL server on '172.27.240.3:3306' (113)
                              LAST_ERROR_TIMESTAMP: 2021-09-23 12:13:11.041796
                           LAST_QUEUED_TRANSACTION: 
 LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
     LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00
       LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 0000-00-00 00:00:00
                              QUEUEING_TRANSACTION: 
    QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
   QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
        QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00

This table refers to the IO_Thread. There is no need to check the content of performance_schema.replication_applier_status_% that are in fact related to the SQL_Thread.

Credentials Issues

When the network issue is resolved, we might also encounter authentication issues like wrong user, wrong password… or SSL issues.

In that case you will see the following error:

 SQL > select * from performance_schema.replication_connection_status\G
*************************** 1. row ***************************
                                      CHANNEL_NAME: replication_channel
                                        GROUP_NAME: 
                                       SOURCE_UUID: 
                                         THREAD_ID: NULL
                                     SERVICE_STATE: CONNECTING
                         COUNT_RECEIVED_HEARTBEATS: 0
                          LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
                          RECEIVED_TRANSACTION_SET: 
                                 LAST_ERROR_NUMBER: 1045
                                LAST_ERROR_MESSAGE: error connecting to master 'repl_mds@172.27.240.2:3306' - retry-time: 60 retries: 1 message: Access denied for user 'repl_mds'@'10.0.1.115' (using password: YES)
                              LAST_ERROR_TIMESTAMP: 2021-09-23 12:20:18.405572
                           LAST_QUEUED_TRANSACTION: 
 LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
     LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00
       LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 0000-00-00 00:00:00
                              QUEUEING_TRANSACTION: 
    QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
   QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
        QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00

You need to verify the user and password and eventually recreate them. Also, check that you are using the right SSL mode in the Inboud Replication Channel’s console:

In the previous error, I had just used Disabled but the replication use was created with REQUIRE SSL.

Before checking the next category of errors, I would like to introduce some functions and procedures I created to make my life easier. Those functions and procedures I created them in a dedicated schema that I called mds. Every time you see mds. prefixing a function or a store procedure, this is just something I’ve added. You can add those additions too if you find them useful: mds_functions.sql

If you prefer to not add any functions and procedures to MDS, you can also use MySQL Shell replication plugin.

This is an example:

 SQL > call replication_status;
+---------------------+-----------+------------+------------+
| channel_name        | IO_thread | SQL_thread | lag_in_sec |
+---------------------+-----------+------------+------------+
| replication_channel | ON        | OFF        | 11796      |
+---------------------+-----------+------------+------------+
1 row in set (0.0984 sec

Replication Stops

Now we will see why the replication channel stops and how to fix it.

Once again, we can divide this category in two sub-categories:

stops immediately after first start
stops unexpectedly while is was running

Inbound Replication stops after first start

When Inbound Replication stops after having started for the first time, the 2 common reasons are:

user didn’t tell to MDS instance the status of the initial load
the source doesn’t have the required data anymore in its binary logs

No dataset definition

Usually, when you create a replica in MDS, you need first to perform a dump using MySQL Shell and loading it in MDS.

If you use Object Storage to dump the data, when you create the MDS instance, you can specify the initial data to load. This is the recommended way.

If that method is used, there won’t be any problem regarding the state of the dataset imported (GTID purged). However, if you do that initial import manually using MySQL Shell or any other logical method, you will have to tell MDS what are the GTIDs defining that dataset.

If you don’t, once Inbound Replication will start, it will start to replicate from the first binlog still available on the source and most probably try to apply transactions already present in the dataset and fail with a duplicate entry error or a missing key.

To illustrate this, I imported manually a dump from the MySQL source server and then I added an inbound replication channel.

Replication started and then failed with the following error message:

Slave SQL for channel 'replication_channel': Worker 1 failed executing transaction 'b545b0e8-139e-11ec-b9ee-c8e0eb374015:35' at master log binlog.000008, end_log_pos 32561; Could not execute Write_rows event on table rss.ttrss_feeds; Duplicate entry '2' for key 'ttrss_feeds.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binlog.000008, end_log_pos 32561, Error_code: MY-001062

We can see that this is indeed a duplicate entry error: HA_ERR_FOUND_DUPP_KEY.

If we use the mds.replication_status_extended procedure, we will see the following:

 SQL > call mds.replication_status_extended \G
*************************** 1. row ***************************
            channel_name: replication_channel
               IO_thread: ON
              SQL_thread: OFF
 last_queued_transaction: b545b0e8-139e-11ec-b9ee-c8e0eb374015:105
last_applied_transaction: b545b0e8-139e-11ec-b9ee-c8e0eb374015:34
         rep delay (sec): 1999483.584932
          transport time: 209.787793
                 time RL: 0.000008
              apply time: 0.002185
              lag_in_sec: 172787

And we can also verify the value of gtid_executed and gtid_purged:

 SQL > select @@gtid_executed, @@gtid_purged\G
*************************** 1. row ***************************
@@gtid_executed: 34337faa-1c79-11ec-a36f-0200170403db:1-88,
b545b0e8-139e-11ec-b9ee-c8e0eb374015:1-34
  @@gtid_purged:

Oups… we realize that we forgot to purge the GTIDs present in the dump. We can now retrieve that info from the dump itself:

So we need to fix that. In MDS, you are allowed to change the value of GTID_PURGED using sys.set_gtid_purged():

 SQL > call sys.set_gtid_purged("b545b0e8-139e-11ec-b9ee-c8e0eb374015:35-101");

Note that I started from :35

You can now resume the replication channel from the OCI console.

As for sys.set_gtid_purgeg(), it will be very nice to have such procedure to restart the Replica_SQL_Running thread.

It’s common to see similar contradiction as the status on the console can take several minutes to be updated… this can lead to huge replication lag in case of a small issue (imagine is there are 2 small consecutive issues… extra 20minutes to restart replication!)

Missing Binlogs

Another common error, is to use a too old dump or not keeping the binary logs long enough on the source. If the next transaction if not present anymore in the source’s binlogs, the replication won’t be possible and the following error will be returned:

 last_error_message: Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '4b663fdc-1c7e-11ec-8634-020017076ca2:1-100,
b545b0e8-139e-11ec-b9ee-c8e0eb374015:35-101', and the missing transactions are 'b545b0e8-139e-11ec-b9ee-c8e0eb374015:1-34:102-108''

Nothing can be done if you want to keep data integrity. The only solution is to create the dump again and insure to not purge the binary logs that will be required.

Conclusion

MySQL Database Service Inbound Replication is already great between MDS servers (for DR, etc..) but the DBA must be very attentive when having to deal with external load (on-premise, other clouds, …), as some bad practices won’t be allowed (old engines, FLUSH without LOCAL, adding stuff in system schema like mysql or sys).

What I’m really looking forward for Inbound Replication Channel is:

being able to restart it more quickly from SQL
being able to use filters (would be very useful for migration from RDS and Aurora to MDS)
the possibility to set the replica READ_ONLY
being supported in MDS HA

Thank you for using MySQL and MySQL Database Service.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.