Setup DR for your MySQL InnoDB Cluster

MySQL InnoDB Cluster is the High Availability solution for MySQL. It delivers automatic fail-over and guarantees zero data loss (RPO=0).

RPO: Recovery Point Objective describes the interval of time that might pass during a disruption before the quantity of data lost during that period exceeds the Business Continuity Plan’s maximum allowable tolerance.

Example: our business architecture needs to have RPO=2 minutes. This means that in case of failure, 2 minutes of data can be lost.

However, and we saw this recently in Europe, an entire data center can “disappear” instantaneously… So it’s also important to have a Disaster Recovery plan.

One solution, is to have an InnoDB Cluster (Group Replication) that spans across multiple regions. However, this is often not feasible because of high latency across regions.

Another solution is InnoDB Cluster in one region with Asynchronous Replication to another region for DR. Using asynchronous replication means that it’s impossible to guarantee zero data loss (with asynchronous replication RPO>0) between the source and the replica but this is usually enough for Disaster Recovery as it is usually measured in seconds.

Let’s see how to create such DR solution. The solution illustrated here is based on MySQL 8.0.23.

Prepare the production environment

mysql-node1> CREATE USER 'repl'@'%' IDENTIFIED BY 'password' REQUIRE SSL;
mysql-node1> GRANT REPLICATION SLAVE, BACKUP_ADMIN, CLONE_ADMIN ON . TO 'repl'@'%';
mysql-node1> GRANT SELECT ON performance_schema.* TO 'repl'@'%';

These statements must be entered on the Primary member.

Please pay attention to some privileges, CLONE_ADMIN is important to use the wonderful CLONE Plugin. And usually, when using asynchronous replication the last privilege (SELECT ON performance_schema.*) is not required, but we will see later why it’s very important in our case.

Prepare the Asynchronous Replica

After having installed MySQL on the server we will use as DR, we need prepare it to act as Asynchronous Replica:

mysql-replica> SET PERSIST_ONLY gtid_mode=on;
mysql-replica> SET PERSIST_ONLY enforce_gtid_consistency=true;
mysql-replica> RESTART;
mysql-replica> INSTALL PLUGIN clone SONAME 'mysql_clone.so';
mysql-replica> SET GLOBAL clone_valid_donor_list='mysql-node1:3306';
mysql-replica> CLONE INSTANCE FROM repl@"mysql-node1":3306 IDENTIFIED BY 'password';
mysql-replica> SET PERSIST server_id=round(uuid_short()/1000000000);

mysql-node1 is the Primary member of the production MySQL InnoDB Cluster.

server_id must be unique, you can use any value you want, in this example I use a random one generated by uuid_short()

Asynchronous Connection Failover

If you follow my blog, you know that I was advising to use asynchronous replication via MySQL Router when using asynchronous replication from an InnoDB Cluster. This is not required anymore !

Now the best solution is to use the Asynchronous Connection Failover mechanism.

This is what we will deploy now. The first thing to do is to retrieve the Group Replication UUID from the production MySQL InnoDB Cluster:

mysql-nodeX> sql show global variables like 'group_replication_group_name'\G
 ********************* 1. row ****************************
 Variable_name: group_replication_group_name
         Value: b2b2b6de-9ad7-11eb-88a0-020017018a7b

And on the future Asynchronous Replica used as DR, we can setup the automated asynchronous failover:

mysql-replica> SELECT asynchronous_connection_failover_add_managed("async_from_idc",
                      "GroupReplication", "b2b2b6de-9ad7-11eb-88a0-020017018a7b",
                      "mysql-node1", 3306, "", 80, 60);

The call to this UDF will return the following text:

The UDF asynchronous_connection_failover_add_managed() executed successfully.

Setup & Start Asynchronous Replication Channel

We are now ready to setup the asynchronous replication channel and start it:

mysql-replica> CHANGE REPLICATION SOURCE TO  source_host='mysql-node1',
              source_port=3306, source_user='repl', source_password='password',
              source_auto_position=1, source_ssl=1, source_retry_count=3, 
              source_connect_retry=10, source_connection_auto_failover=1 
              FOR CHANNEL 'async_from_idc';

async_from_idc is just a name, you can use whatever you want but must be the same as defined in the asynchronous connection failover.

And we can start replication:

mysql-replica> start replica for channel 'async_from_idc';

We are done !

Now, if the Production MySQL InnoDB Cluster promotes a new Primary node, the asynchronous replica will also automatically change its source ! \o/

In case you need to use your DR site for production, it’s very easy to “upgrade” it to InnoDB Cluster once promoted after having stopped and reset the asynchronous replication channel.

Setting up an asynchronous replication channel from one MySQL InnoDB Cluster to another MySQL InnoDB cluster is not supported and does not work at this time.

Observability

It’s also possible to verify the setup using performance_schema tables:

mysql-replica> select * from performance_schema.replication_asynchronous_connection_failover_managed\G
 *********************************** 1. row *****************************
  CHANNEL_NAME: async_from_idc
  MANAGED_NAME: b2b2b6de-9ad7-11eb-88a0-020017018a7b
  MANAGED_TYPE: GroupReplication
 CONFIGURATION: {"Primary_weight": 80, "Secondary_weight": 60}
mysql-replica> select * from performance_schema.replication_asynchronous_connection_failover\G
*********************************** 1. row *****************************
      CHANNEL_NAME: async_from_idc
              HOST: mysql-node1
              PORT: 3306
 NETWORK_NAMESPACE: 
            WEIGHT: 80
      MANAGED_NAME: b2b2b6de-9ad7-11eb-88a0-020017018a7b
*********************************** 2. row *****************************
      CHANNEL_NAME: async_from_idc
              HOST: mysql-node2
              PORT: 3306
 NETWORK_NAMESPACE: 
            WEIGHT: 60
      MANAGED_NAME: b2b2b6de-9ad7-11eb-88a0-020017018a7b
********************************** 3. row *******************************
      CHANNEL_NAME: async_from_idc
              HOST: mysql-node3
              PORT: 3306
 NETWORK_NAMESPACE: 
            WEIGHT: 60
      MANAGED_NAME: b2b2b6de-9ad7-11eb-88a0-020017018a7b

Troubleshooting

If you remember what I wrote at the beginning of this post, the privileges of the replication user are very important. If the user is not allowed to select from performance_schema tables, you will see the following error in the error log and the failover won’t succeed:

The IO thread failed to detect if the source belongs to the group majority on the source (host:mysql-node1 port:3306 network_namespace:) for channel 'async_from_idc'.

Don’t forget that in MySQL 8.0 you can now access the error log also from performance_schema.error_log table !

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

14 Comments

  1. thanks this was a lot of help

    If anyone finds this and is a WINDOWS user who has a INNODB Cluster set up in one location and would like a disaster recovery copy at another location / site, then here is what I did.

    First, you need to choose one node in the live cluster to connect to.
    I chose node B5. You need to forward the port in your router to allow TCP traffic.
    I set up external port 9500 to :3306

    The reason for this is the Master_Log_file changes name from node to node. On one node it is called SQL-Cluster-B5-bin.00032 but on another node it is called SQL-Cluster-B1-bin.00032.
    Which just doesn’t work if you point the DR node to port 6446 hoping the cluster router will send the DR node to the next node (B1) if the one you are connected to goes down (B5).
    You will get the wrong BIN file.
    So this DR is based on the availability of a NODE in you Cluster not the entire cluster. But I will work on this so check if there are other posts below.

    Anyway, here we go….

    In your LIVE cluster, log into B5 and run this in workbench

    CREATE USER ‘home123’@’%’ IDENTIFIED BY ‘password123’ REQUIRE SSL;
    GRANT REPLICATION SLAVE, BACKUP_ADMIN, CLONE_ADMIN ON *.* TO ‘home123’@’%’;
    GRANT SELECT ON performance_schema.* TO ‘home123’@’%’;
    INSTALL PLUGIN clone SONAME ‘mysql_clone.dll’;
    flush privileges;

    On the DR node run workbench and update this setting (I found it timed out)
    Edit → Preferences → SQL Editor → DBMS connection read time out (in seconds): 600
    Changed the value to 6000.

    then run this in workbench
    SET PERSIST_ONLY gtid_mode=on;
    SET PERSIST_ONLY enforce_gtid_consistency=true;
    INSTALL PLUGIN clone SONAME ‘mysql_clone.dll’;
    CREATE USER ‘home123’@’%’ IDENTIFIED BY ‘password123’ REQUIRE SSL;
    GRANT REPLICATION SLAVE, BACKUP_ADMIN, CLONE_ADMIN ON *.* TO ‘home123’@’%’;
    GRANT SELECT ON performance_schema.* TO ‘home123’@’%’;
    RESTART;

    then run this (still on the DR node)
    (Note : : is just the WAN IP of the LIVE node you are using EG. 111.222.333.444:9500 )

    flush privileges;
    SET GLOBAL clone_valid_donor_list=’:’;
    CLONE INSTANCE FROM home123@: IDENTIFIED BY ‘password123′;

    the clone will now start , you can check how it is going by looking at the size in the properties window of your “C:\ProgramData\MySQL\MySQL Server 8.0\Data” folder

    once done run this (still on the DR node)
    CHANGE REPLICATION SOURCE TO source_host=”,
    source_port=, source_user=’home123′, source_password=’password123’,
    source_auto_position=1, source_ssl=1, source_retry_count=10000,
    source_connect_retry=10, source_connection_auto_failover=1 ;

    That is it.
    Run a select statement to count(*) a table to check the DR is running.
    select count(*) from table1

    also SHOW REPLICA STATUS will give you all the details.

    If you shutdown the live node and then restart it you will see the Cluster recovers the node and then the DR copy gets updated.

    Anyway thanks
    Damon

    • just noticed that some details have been removed by the formatting
      please replace the last bit with this


      then run this (still on the DR node)
      (Note : You need the WAN of the LIVE server and the port you are forwarding from
      EG. 201.222.33.99 : 9500 so in the below change it to your WAN – pretty sure you could use a http://www.acme.com.au as well )

      flush privileges;
      SET GLOBAL clone_valid_donor_list=’201.222.33.99:9500’;
      CLONE INSTANCE FROM home123@201.222.33.99:9500 IDENTIFIED BY ‘password123′;

      the clone will now start , you can check how it is going by looking at the size in the properties window of your “C:\ProgramData\MySQL\MySQL Server 8.0\Data” folder

      once done run this (still on the DR node)
      CHANGE REPLICATION SOURCE TO source_host=201.222.33.99′,
      source_port=9500 , source_user=’home123′, source_password=’password123’,
      source_auto_position=1, source_ssl=1, source_retry_count=10000,
      source_connect_retry=10, source_connection_auto_failover=1 ;

    • Hi Daemon,

      When using GTID (and AUTO_POSITION), that the binary logs change name or position is not a problem at all. Also with the feature presented in this post, doing replication over MySQL Router is not needed anymore.

      Thank you to read my blog.

      Cheers,
      lefred

  2. Hye lefred,

    If we want to promote DR as production in case of DC crash where innodb cluster totally down, application still need to change connection string to mysql DR? You mention it is easily upgradable to innodb cluster.
    My question is:

    1) can we promote it to master like traditional master-slave procedure? Once innodb cluster dc is available, we need to backup and restore DR to innodb cluster right? And reset back replication channel.

    2) If it can be upgraded to innodb cluster, do we need to install mysqlrouter on DR? Or existing router will do.

    Thanks,
    Daniel

    • Looks like I found a way to perform DR recovery to HQ cluster:

      a) After activity done at DR, convert to InnoDB and create tempCluster
      b) Bringup one DB at HQ after troubleshoot, dissolve cluster.
      c) at DR join HQ DB instance (clone)
      d) after sync, promote HQ DB as primary, remove DR instance
      e) dissolve cluster. (check innodb_cluster_metadata tables (instances and clusters) delete unnecessary old cluster_id
      f) Create new cluster, add Node2 and Node3 by cloning
      g) re-bootstrap mysql router

      any thoughts?

  3. Hello:

    I have the same question as Daniel, if there is a total failure in the main data center and I am going to use my single instance in the alternate data center, we operate without problem … but then, when the main data center is ok, how do I “bounce back”, considering that the proxy has already been receiving data …. and it will be inconsistent data if I only point my applications to the innodb cluster in the main datacenter.

  4. Hi lefred,

    As you have mentioned that Replicating from one InnoDB Cluster to another is not supported and will not work. But we are already in the process of testing this type of DR solution (ASYNC Master-Master replication between two Innodb clusters). And this is actually working for me as of now.

    So, can you please explain what is not going to work here or, I need to test some specific scenario to generate issue in this DR solution. It is great if you tell me that one scenario. Meanwhile, I am sharing the details below for our setup.

    1. Setup two Innodb Cluster on DC1 and DC2 (with three nodes in each DC).
    2. Setup ASync replication from DC1 > DC2 and DC2 > DC1 (async master-master rpl).
    3. There is HAProxy also in top of both the DCs, so when DC1 goes down, HAProxy will route the traffic to DC2 (yes, app will point to one DC at a time).

    What we have followed to setup Async Rpl?

    – We started Innodb cluster on DC1.
    – Copied data from DC1 to DC2 (on single mysql node on DC2, Innodbcluster is not setup at this time on DC2) using mysqlbackup (with skip-gtid-purged=off, to set gtid_purged). And started async replication (DC1 > DC2, gtid auto-pos set to ON).
    – After we checked successful execution of DC1 > DC2 replication, we stopped the rpl and perform reset slave to remove replication setup (bcs the next step is to create innodb cluster, and var c=dba.createCLuster() will generate an error if it find replication filter on underlying mysql node).
    – We created DC2 innodb cluster and added other two nodes with recovery option clone.
    – After successful innodb cluster creation on DC2. We started DC1 > DC2 async rpl again on primary node using auto-pos=1, and it is working fine.
    – Lastly, we setup DC2 > DC1, using auto-pos=1, also working fine.
    – ASync replication thread is connected via Router for both DC1 > DC2 and DC2 > DC1.

    It will be really great if you provide a right direction.
    Thanks.

    • Hi Shubham,

      Master-Master replication is not something I would ever recommend with Asynchronous Replication in between.

      Also what is happening when you change the Primary in DC2 for example ?

      The metadata will be a problem. Your chance is that for routing you are not relying for the moment to the metadata either but with MySQL Router you do too of course.

  5. Hi Lefred

    How do we setup a high availability solution using inno DB cluster? Can you point me to any of your blog link?

    Thankyou
    Charandeep

Leave a Reply to Ali KhanCancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.