MySQL Database Service inbound replication channel troubleshooting

Using a MySQL Database Service instance in OCI as an asynchronous replica is very useful. It allows testing the service with always updated data from on-premise production. It can be used in the process of migrating with minimal downtime to OCI and finally, it can also be used between MDS instances on different regions for DR, analytics, etc…

Here are some links to the documentation and other resources about inbound replication channel and how to use it:

In this article, we consider that you already have everything ready to replicate from the source to a MDS instance. This means that the network part (VCNs, eventual VPN, routing, security list) and the replication user and credentials have been configured.

We will consider errors like described in this schema:

But before checking what kind of error we can encounter, let’s see how we can verify the status of the inbound replication channel.

Inbound Replication Channel Status

The first source to see if the replication channel is running, is the OCI MDS console:

We can also use the oci sdk to get that info. For example, from the cloud shell, you can get the status using the channel’s ocid:

$ oci mysql channel get --channel-id <channel_ocid> | grep lifecycle-state
  "lifecycle-state": "NEEDS_ATTENTION"

Let’s see an example:

However, we have no idea of what is wrong, but we know we need to check.

Please note that the status on the console is not updated in real-time and can take up to 10mins before noticing a problem. I encourage you to check the MySQL Replication Channel status directly from the replica instance itself (more on that later).

How to know what is the problem ?

The only way to see what is the problem, is to connect on the MDS replica instance and verify using our client. I use MySQL Shell.

Usually replication problems are reported in 4 different locations:

  • output of SHOW REPLICA STATUS\G (which is the most commonly used but not recommended)
  • performance_schema.replication_connection_status
  • performance_schema.replication_applier_status_by_coordinator and performance_schema.replication_applier_status_by_worker
  • MySQL’s error log (available using performance_schema.error_log in MDS)

This is an example:

 SQL > select * from performance_schema.error_log 
       where SUBSYSTEM = 'Repl' and PRIO = 'Error' 
       order by 1 desc limit 1\G
*************************** 1. row ***************************
    LOGGED: 2021-09-23 08:37:22.602686
 THREAD_ID: 51
      PRIO: Error
ERROR_CODE: MY-010584
 SUBSYSTEM: Repl
      DATA: Slave SQL for channel 'replication_channel': Worker 1 failed executing transaction 'b545b0e8-139e-11ec-b9ee-c8e0eb374015:35' at master log binlog.000008, end_log_pos 32561; Could not execute Write_rows event on table rss.ttrss_feeds; Duplicate entry '2' for key 'ttrss_feeds.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binlog.000008, end_log_pos 32561, Error_code: MY-001062

We will cover the following issues:

  1. Network Issues
  2. Credentials Issues
  3. No dataset definition
  4. Missing Binlogs
  5. Other Replication Problems related to statements

Now let’s focus on the first category of Inbound replication issues.

Replication Channel does not start

When you have created the replication channel and it immediately complain, it’s usually caused by one of these two issues:

  1. the replica cannot connect to the source due to a network issue
  2. the replica cannot connect to the source due to wrong credentials

Network Issue

If you have a network issue, meaning that MDS is not able to connect on the host and port you defined when creating the inbound replication channel. The only way to fix this is to control the security list and eventual routing.

If we check the 4 sources of information we will see the following information:

From SHOW REPLICA STATUS\G (last time I use it!):

 SQL > show replica status\G
*************************** 1. row ***************************
             Replica_IO_State: Connecting to source
                  Source_Host: 172.27.240.3
                  Source_User: repl_mds
                  Source_Port: 3306
                Connect_Retry: 60
                           ...
           Replica_IO_Running: Connecting
          Replica_SQL_Running: Yes
                           ...
                Last_IO_Errno: 2003
                Last_IO_Error: error connecting to master 'repl_mds@172.27.240.3:3306' - retry-time: 60 retries: 2 message: Can't connect to MySQL server on '172.27.240.3:3306' (113)

And from Performance_Schema table:

 SQL > select * from performance_schema.replication_connection_status\G
*************************** 1. row ***************************
                                      CHANNEL_NAME: replication_channel
                                        GROUP_NAME: 
                                       SOURCE_UUID: 
                                         THREAD_ID: NULL
                                     SERVICE_STATE: CONNECTING
                         COUNT_RECEIVED_HEARTBEATS: 0
                          LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
                          RECEIVED_TRANSACTION_SET: 
                                 LAST_ERROR_NUMBER: 2003
                                LAST_ERROR_MESSAGE: error connecting to master 'repl_mds@172.27.240.3:3306' - retry-time: 60 retries: 6 message: Can't connect to MySQL server on '172.27.240.3:3306' (113)
                              LAST_ERROR_TIMESTAMP: 2021-09-23 12:13:11.041796
                           LAST_QUEUED_TRANSACTION: 
 LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
     LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00
       LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 0000-00-00 00:00:00
                              QUEUEING_TRANSACTION: 
    QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
   QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
        QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00

This table refers to the IO_Thread. There is no need to check the content of performance_schema.replication_applier_status_% that are in fact related to the SQL_Thread.

Credentials Issues

When the network issue is resolved, we might also encounter authentication issues like wrong user, wrong password… or SSL issues.

In that case you will see the following error:

 SQL > select * from performance_schema.replication_connection_status\G
*************************** 1. row ***************************
                                      CHANNEL_NAME: replication_channel
                                        GROUP_NAME: 
                                       SOURCE_UUID: 
                                         THREAD_ID: NULL
                                     SERVICE_STATE: CONNECTING
                         COUNT_RECEIVED_HEARTBEATS: 0
                          LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
                          RECEIVED_TRANSACTION_SET: 
                                 LAST_ERROR_NUMBER: 1045
                                LAST_ERROR_MESSAGE: error connecting to master 'repl_mds@172.27.240.2:3306' - retry-time: 60 retries: 1 message: Access denied for user 'repl_mds'@'10.0.1.115' (using password: YES)
                              LAST_ERROR_TIMESTAMP: 2021-09-23 12:20:18.405572
                           LAST_QUEUED_TRANSACTION: 
 LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
     LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00
       LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 0000-00-00 00:00:00
                              QUEUEING_TRANSACTION: 
    QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
   QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00
        QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00

You need to verify the user and password and eventually recreate them. Also, check that you are using the right SSL mode in the Inboud Replication Channel’s console:

In the previous error, I had just used Disabled but the replication use was created with REQUIRE SSL.

Before checking the next category of errors, I would like to introduce some functions and procedures I created to make my life easier. Those functions and procedures I created them in a dedicated schema that I called mds. Every time you see mds. prefixing a function or a store procedure, this is just something I’ve added. You can add those additions too if you find them useful: mds_functions.sql

If you prefer to not add any functions and procedures to MDS, you can also use MySQL Shell replication plugin.

This is an example:

 SQL > call replication_status;
+---------------------+-----------+------------+------------+
| channel_name        | IO_thread | SQL_thread | lag_in_sec |
+---------------------+-----------+------------+------------+
| replication_channel | ON        | OFF        | 11796      |
+---------------------+-----------+------------+------------+
1 row in set (0.0984 sec

Replication Stops

Now we will see why the replication channel stops and how to fix it.

Once again, we can divide this category in two sub-categories:

  1. stops immediately after first start
  2. stops unexpectedly while is was running

Inbound Replication stops after first start

When Inbound Replication stops after having started for the first time, the 2 common reasons are:

  • user didn’t tell to MDS instance the status of the initial load
  • the source doesn’t have the required data anymore in its binary logs

No dataset definition

Usually, when you create a replica in MDS, you need first to perform a dump using MySQL Shell and loading it in MDS.

If you use Object Storage to dump the data, when you create the MDS instance, you can specify the initial data to load. This is the recommended way.

If that method is used, there won’t be any problem regarding the state of the dataset imported (GTID purged). However, if you do that initial import manually using MySQL Shell or any other logical method, you will have to tell MDS what are the GTIDs defining that dataset.

If you don’t, once Inbound Replication will start, it will start to replicate from the first binlog still available on the source and most probably try to apply transactions already present in the dataset and fail with a duplicate entry error or a missing key.

To illustrate this, I imported manually a dump from the MySQL source server and then I added an inbound replication channel.

Replication started and then failed with the following error message:

Slave SQL for channel 'replication_channel': Worker 1 failed executing transaction 'b545b0e8-139e-11ec-b9ee-c8e0eb374015:35' at master log binlog.000008, end_log_pos 32561; Could not execute Write_rows event on table rss.ttrss_feeds; Duplicate entry '2' for key 'ttrss_feeds.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binlog.000008, end_log_pos 32561, Error_code: MY-001062

We can see that this is indeed a duplicate entry error: HA_ERR_FOUND_DUPP_KEY.

If we use the mds.replication_status_extended procedure, we will see the following:

 SQL > call mds.replication_status_extended \G
*************************** 1. row ***************************
            channel_name: replication_channel
               IO_thread: ON
              SQL_thread: OFF
 last_queued_transaction: b545b0e8-139e-11ec-b9ee-c8e0eb374015:105
last_applied_transaction: b545b0e8-139e-11ec-b9ee-c8e0eb374015:34
         rep delay (sec): 1999483.584932
          transport time: 209.787793
                 time RL: 0.000008
              apply time: 0.002185
              lag_in_sec: 172787

And we can also verify the value of gtid_executed and gtid_purged:

 SQL > select @@gtid_executed, @@gtid_purged\G
*************************** 1. row ***************************
@@gtid_executed: 34337faa-1c79-11ec-a36f-0200170403db:1-88,
b545b0e8-139e-11ec-b9ee-c8e0eb374015:1-34
  @@gtid_purged: 

Oups… we realize that we forgot to purge the GTIDs present in the dump. We can now retrieve that info from the dump itself:

So we need to fix that. In MDS, you are allowed to change the value of GTID_PURGED using sys.set_gtid_purged():

 SQL > call sys.set_gtid_purged("b545b0e8-139e-11ec-b9ee-c8e0eb374015:35-101");

Note that I started from :35

You can now resume the replication channel from the OCI console.

As for sys.set_gtid_purgeg(), it will be very nice to have such procedure to restart the Replica_SQL_Running thread.

It’s common to see similar contradiction as the status on the console can take several minutes to be updated… this can lead to huge replication lag in case of a small issue (imagine is there are 2 small consecutive issues… extra 20minutes to restart replication!)

Missing Binlogs

Another common error, is to use a too old dump or not keeping the binary logs long enough on the source. If the next transaction if not present anymore in the source’s binlogs, the replication won’t be possible and the following error will be returned:

 last_error_message: Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '4b663fdc-1c7e-11ec-8634-020017076ca2:1-100,
b545b0e8-139e-11ec-b9ee-c8e0eb374015:35-101', and the missing transactions are 'b545b0e8-139e-11ec-b9ee-c8e0eb374015:1-34:102-108''

Nothing can be done if you want to keep data integrity. The only solution is to create the dump again and insure to not purge the binary logs that will be required.

Other Replication Problems

Inbound replication channel can also break later when everything is working fine. Let’s see how we can deal with that.

On the on-premise source I run this query:

SOURCE SQL > create table oups (id int) engine=MyISAM;

And now on my replica in MDS:

 MDS REPLICA SQL > call replication_status;
+---------------------+-----------+------------+------------+
| channel_name        | IO_thread | SQL_thread | lag_in_sec |
+---------------------+-----------+------------+------------+
| replication_channel | ON        | OFF        | 142        |
+---------------------+-----------+------------+------------+
1 row in set (0.1291 sec)

Let’s verify what is the problem:

MDS REPLICA SQL > call replication_errors()\G
Empty set (0.1003 sec)

*************************** 1. row ***************************
        channel_name: replication_channel
last_error_timestamp: 2021-09-23 16:09:59.079998
  last_error_message: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'b545b0e8-139e-11ec-b9ee-c8e0eb374015:121' at master log binlog.000014, end_log_pos 6375. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
1 row in set (0.1003 sec)

*************************** 1. row ***************************
        channel_name: replication_channel
last_error_timestamp: 2021-09-23 16:09:59.080003
  last_error_message: Worker 1 failed executing transaction 'b545b0e8-139e-11ec-b9ee-c8e0eb374015:121' at master log binlog.000014, end_log_pos 6375; Error 'Storage engine MyISAM is disabled (Table creation is disallowed).' on query. Default database: 'test'. Query: 'create table oups (id int) engine=MyISAM'
1 row in set (0.1003 sec)

Query OK, 0 rows affected (0.1003 sec)

We can get some more information about the GTIDs too:

We need to skip that transaction on our replica and then maybe create the table manually or recreate it as InnoDB on the source.

To skip it, we need to deal with GTID_NEXT and genera an empty transaction. I’ve created a procedure to perform all the steps for me:

MDS REPLICA SQL > call skip_replication_error();
+------------------------------------------+
| skipped GTID                             |
+------------------------------------------+
| b545b0e8-139e-11ec-b9ee-c8e0eb374015:121 |
+------------------------------------------+
1 row in set (0.1080 sec)

This is an example using MySQL Shell Plugin:

And now, you can go on the console and as soon as you can resume the replication channel:

Let’s see what can break MDS Inbound Replication that usually doesn’t break MySQL Asynchronous replication:

  • FLUSH PRIVILEGES
  • FLUSH STATUS
  • Create something (table, view, function, procedure) in mysql or sys
  • Unsupported STORAGE ENGINE
  • Unsupported or removed syntax (like creating a user from GRANT statement),
  • … see the full limitations in the manual.

This is an old bad practice to use the FLUSH statement without specifying to be LOCAL as many FLUSH statements are replicated. See the manual.

It’s time to change your habit and specify the LOCAL keyword to all your FLUSH statements to be sure they don’t get replicated (this is also recommended in any replication topology, not only with MDS).

So FLUSH PRIVILEGES becomes FLUSH LOCAL PRIVILEGES !

Conclusion

MySQL Database Service Inbound Replication is already great between MDS servers (for DR, etc..) but the DBA must be very attentive when having to deal with external load (on-premise, other clouds, …), as some bad practices won’t be allowed (old engines, FLUSH without LOCAL, adding stuff in system schema like mysql or sys).

What I’m really looking forward for Inbound Replication Channel is:

  • being able to restart it more quickly from SQL
  • being able to use filters (would be very useful for migration from RDS and Aurora to MDS)
  • the possibility to set the replica READ_ONLY
  • being supported in MDS HA

Thank you for using MySQL and MySQL Database Service.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.