MySQL HA Architecture #1 : InnoDB Cluster & Consul

 

I received many request about MySQL High Availability Architecture. There are a lot of tools, but how can we use them to achieve MySQL HA without over engineering everything.

To answer such demand, there is no generic architecture, of course there are leaders in solutions, but put them together can result in many different designs, I will start a series of article on that topic. Feel free, as usual to comment, but also recommend other tools that could be used in a future post.

So today’s post is related to MySQL InnoDB Cluster with Hashicorp’s Consul.

Architecture

This is the overview of the deployed architecture:

As you can see, we have 3 data centers, but in fact we have only two DCs on premises and one in the cloud. A large amount of request are always related on having only 2 DCs, or at least capacity for only two DCs (storage and CPU). But we already know that to achieve High Availability, we need at least 3 (3 nodes, 3 DCs, …). So in our design, we have DC1 and DC2 running each an independent MySQL InnoDB Cluster in Single-Primary Mode. We also have an asynchronous replication going from one member of dc1 to the primary master of dc2 and reverse way too.

In this post, I won’t cover the asynchronous replication failover that moves the asynchronous slave task on another member of the same group when the Primary Role is moved to another member.

Then we have one consul member on DC3 being there only for Quorum and answer DNS requests.

mysqlha service

In consul we create mysqlha service like this :

{
    "services": [
        {
            "name": "mysqlha",
            "tags": [
                "writes",
                "reads"
            ],
            "port": 3306,
            "checks": [

                {
                    "script": "/etc/consul.d/scripts/check_mysql.sh",
                    "interval": "5s"
                }
            ]
        }
    ]
}

The check_mysql.sh script is just an easy bash script connecting on the read/write port of the local MySQL-Router
running on each member of the InnoDB Cluster (Group) and checking if there is a Primary non-R/O member.

This is what we can see in consul’s UI:

Failover rule

Now we will add a rule (query) for the failover:

curl --request POST --data '{
"Name": "mysqlha",
"Service": {
   "Service": "mysqlha",
   "Failover": {
     "Datacenters": ["dc1","dc2"]
   }, "OnlyPassing": true
  }
}' http://127.0.0.1:8500/v1/query

We need to perform this operation on one member of each DC (dc1, dc2 and dc3).

When this is performed, we can test the query on dc3 (where there is no service mysqlha is running):

[root@monitordc3 ~]# dig @127.0.0.1 -p 53 mysqlha.query.mysql

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.4 <<>> @127.0.0.1 -p 53 mysqlha.query.mysql
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28736
;; flags: qr aa rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;mysqlha.query.mysql.		IN	A

;; ANSWER SECTION:
mysqlha.query.mysql.	0	IN	A	192.168.50.4
mysqlha.query.mysql.	0	IN	A	192.168.50.3
mysqlha.query.mysql.	0	IN	A	192.168.50.2

We can notice that the dns returns all 3 nodes in DC1 (192.168.50.x), perfect ! We can of course connect on any one using the router ports (6446 and 6447).

The next step is related to the application servers, we have here two options, or we use a consul client service to perform the dns resolving or we just setup dns resolving on it but… on all we need to use a the node in dc3 and one node in the other dc (or we can create a fake extra dc only for doing the dns resolving, this is because on dc2 the service is running anyway but we want to be redirected to the primary dc if it’s running… we could query mysqlha.query.dc1.mysql but that will work only if there will be a consul server in dc1 reachable… which is not the case if the full dc goes down).

So if we have an app server in dc2, its name resolving setup should be :

search lan localdomain
nameserver 192.168.70.2
nameserver 192.168.50.2

Asynchronous Slave

As between our two data centers, the bandwidth can be high, we opted for an asynchronous replication between the two DCs (remember, dc2 is used as GEO DR).

So now we need to setup asynchronous replication between both DCs, I opted for the Primary Masters of each DC to be the async master and the async slave. Being the slave is mandatory for the Primary Master as it’s the only one allowed to write data.

I created the replication grants on each Primary Masters (in dc1 and dc2), and I purged the GTID to start the asynchronous replication:

Traditional replication setup

I started with the user and the requested privileges:

mysql> CREATE USER 'async_repl'@'%' IDENTIFIED BY 'asyncpwd';
Query OK, 0 rows affected (0.50 sec)

mysql> GRANT REPLICATION SLAVE ON *.* TO 'async_repl'@'%';
Query OK, 0 rows affected (0.11 sec)

Don’t forget to set GTID_PURGED according to what you see in the other Primary Master.

Example:

First I check on mysqld4dc2 (the current Primary Master in DC2):

mysql4dc2> show global variables like 'GTID_EXECUTED';
+---------------+--------------------------------------------------------------------------------------+
| Variable_name | Value                                                                                |
+---------------+--------------------------------------------------------------------------------------+
| gtid_executed | 0de7c84a-118e-11e8-b11c-08002718d305:1-44,
a6ae95b5-118d-11e8-9ad8-08002718d305:1-27 |
+---------------+--------------------------------------------------------------------------------------+

And then on server in DC1 do:

mysql> SET @@GLOBAL.GTID_PURGED='0de7c84a-118e-11e8-b11c-08002718d305:1-44,
                                 a6ae95b5-118d-11e8-9ad8-08002718d305:1-27'

We do the same on all machines from dc1 with the GTID_EXECUTED from DC2.

Now we will add a new service in consul (on each MySQL nodes in DC1 and DC2):

        {
            "name": "mysqlasync",
            "tags": [
                "slave"
            ],
            "port": 3306,
            "checks": [
                {
                    "script": "/etc/consul.d/scripts/check_async_mysql.sh",
                    "interval": "5s"
                }
            ]
        }

And don’t forget to copy the script check_async_mysql.sh in /etc/consul/scripts/ (this script is not made for production and may contain bugs, it’s just an example). You also need to edit this script to set SLAVEOFDC to the opposite DC’s name.

This script also uses the consul keystore capability to set which node is the Primary-Master in each DC:

[root@mysql1dc1 consul.d]# curl -s "http://localhost:8500/v1/kv/primary?raw&dc=dc1"
mysql1dc1
[root@mysql1dc1 consul.d]# curl -s "http://localhost:8500/v1/kv/primary?raw&dc=dc2"
mysql4dc2

After having reloaded consul, in the UI you can see something like this on the Primary Master being also the asynchronous slave:

And on another node:


In Action

Let’s see it in action:

In the video we can see that when we kill the non Primary member remaining in DC1, the Group lost quorum. Of course the size of the group at that moment is 2 of 2, and when it gets killed even if the Primary one is still active, it has a quorum of 1/2 which is not bigger than 50%. Therefor it stops being available to avoid any split-brain situation.

Then when we restart the nodes we killed in DC1, we have to force quorum in the single remaining node to be able to rejoin the others instances after reboot. This is how:
Force Quorum:

 MySQL  mysql2dc1:3306 ssl  JS > cluster.forceQuorumUsingPartitionOf('root@mysql2dc1:3306')
Restoring replicaset 'default' from loss of quorum, by using the partition 
composed of [mysql2dc1:3306]

Please provide the password for 'root@mysql2dc1:3306': 
Restoring the InnoDB cluster ...

The InnoDB cluster was successfully restored using the partition 
from the instance 'root@mysql2dc1:3306'.

WARNING: To avoid a split-brain scenario, ensure that all other members of the replicaset 
         are removed or joined back to the group that was restored.

Rejoin the first instance:

MySQL mysql2dc1:3306 ssl JS > cluster.rejoinInstance('root@mysql3dc1:3306')
...
The instance 'root@mysql3dc1:3306' was successfully rejoined on the cluster.

The instance 'mysql3dc1:3306' was successfully added to the MySQL Cluster.

And rejoin the last one:

 MySQL  mysql2dc1:3306 ssl  JS > cluster.rejoinInstance('root@mysql1dc1:3306')
...
Rejoining instance to the cluster ...

The instance 'root@mysql1dc1:3306' was successfully rejoined on the cluster.

The instance 'mysql1dc1:3306' was successfully added to the MySQL Cluster.

And finally we can see that all members are back online:

 MySQL  mysql2dc1:3306 ssl  JS > cluster.status();
{
    "clusterName": "MyGroupDC1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "mysql2dc1:3306", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "mysql1dc1:3306": {
                "address": "mysql1dc1:3306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "mysql2dc1:3306": {
                "address": "mysql2dc1:3306", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "mysql3dc1:3306": {
                "address": "mysql3dc1:3306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }
        }
    }, 
    "groupInformationSourceMember": "mysql://root@mysql2dc1:3306"

Conclusion

As you noticed, MySQL Group Replication and MySQL InnoDB Cluster can be the base of a full MySQL High Availability architecture using multiple Data Centers.  MySQL InnoDB Cluster can of course be used with a multitude of other solutions like Orchestrator, ProxySQL, … I’ll try to illustrate these solutions when I have some time, stay tuned !

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

12 Comments

  1. >We also have an asynchronous replication going from one member of
    >dc1 to the primary master of dc2 and reverse way too.

    Thank you for nice blog posts.
    Just confirmation about replication topology of
    the asynchronous replication between dc1 and dc2.

    Is it the best for user to implement this configuration after MySQL8.0 GA?
    It is include the new feature like “WL#7361: MSR: per-channel replication filters”.

    • Hi Shinya.

      This works also on 5.7. I think you are asking this to filter out the metadata (mysql_innodb_cluster_metadata) schema. I’m not a big fan of filtering out stuff usually but in this case it might help.
      The metadata is already aware of having multiple clusters and instances. The router is the one not being happy with mutliple clusters for bootstrap. But you can set it up manually.

      My metadata looks like this:

       MySQL  mysql1dc1:3306   mysql_innodb_cluster_metadata  SQL  select * from clusters\G
      *************************** 1. row ***************************
               cluster_id: 1
             cluster_name: MyGroupDC1
       default_replicaset: 1
              description: Default Cluster
      mysql_user_accounts: NULL
                  options: null
               attributes: {"default": true}
      *************************** 2. row ***************************
               cluster_id: 2
             cluster_name: MyGroupDC2
       default_replicaset: 2
              description: Default Cluster
      mysql_user_accounts: NULL
                  options: null
               attributes: {"default": true}
      2 rows in set (0.00 sec)
      

      and my instances:

       MySQL  mysql1dc1:3306   mysql_innodb_cluster_metadata  SQL  select instance_id, host_id, replicaset_id, instance_name from instances;
      +-------------+---------+---------------+----------------+
      | instance_id | host_id | replicaset_id | instance_name  |
      +-------------+---------+---------------+----------------+
      |           1 |       1 |             1 | mysql1dc1:3306 |
      |           2 |       2 |             1 | mysql2dc1:3306 |
      |           3 |       3 |             1 | mysql3dc1:3306 |
      |           7 |       4 |             2 | mysql4dc2:3306 |
      |          12 |       5 |             2 | mysql5dc2:3306 |
      |          19 |       6 |             2 | mysql6dc2:3306 |
      +-------------+---------+---------------+----------------+
      6 rows in set (0.00 sec)
  2. > Of course the size of the group at that moment is 2 of 2, and when it gets killed even if the Primary one is still active, it has a quorum of 1/2 which is not bigger than 50%. Therefor it stops being available to avoid any split-brain situation.

    The group is still expecting 3 nodes. The sentence should be fixed to:

    Of course the size of the group at that moment is 2 of 3, and when it gets killed even if the Primary one is still active, it has a vote of 1/3 which is not bigger than 50%. Therefor it stops being available to avoid any split-brain situation.

    • Hi Shlomi,
      Thank you for your comment, but in fact, at the moment I kill the 2nd node, the group is not 2/3 but 2/2… because the size of the Group is always updated.

      This means that at start the group is 3/3, then when node is killed, the view of the world is 2/3… then in GCS/XCOM they all agree that this node is gone and the view of the world is 2/2 (but in the metadata of InnoDB Cluster, the cluster.status() knows that there were 3 members… but not GCS/XCOM, Group Replication). This view of the world is always agreed by the consensus between the members in the group.

      I hope this clarifies it a bit more.

      • Hi Fred,
        Then this confuses me. According to the flow you describe, if we begin with 5 good nodes, one goes down, we’re at 4/4. And then another goes down, and we’re at 3/3. And then another goes down and we’re at 2/2 and supposedly there’s a quorum, where in fact 2/5 is not a quorum.
        Nodes that went down need to be considered as being down, or else we lose consensus guarantee.

        Also, if I understand correctly your description is inconsistent, because in order to be consistent, this:

        > Of course the size of the group at that moment is 2 of 2, and when it gets killed even if the Primary one is still active, it has a quorum of 1/2 which is not bigger than 50%. Therefor it stops being available to avoid any split-brain situation.

        should actually change to:

        [ Of course the size of the group at that moment is 2 of 2, and when it gets killed even if the Primary one is still active, it has a quorum of 1/1 which is 100%. ]

        What is the reason you update the group size from 3 to 2 when the first node dies, but not update from 2 to 1 when the seconds node dies? This is an inconsistency.

        • I may need to write a full blog post on this but indeed… also it depends how you stop a node, so if you have 3 nodes and you stop (clean shutdown 1), you go from 3/3 to 2/2, if you stop properly again another one, you from 2/2/ to 1/1/ and your cluster/resource is still available. Now if you have a 5/5 group and you kill -9 one node you go from 5/5 to 4/5… but then it’s 4/4 then if you kill -9 another one you go from 4/4 to 3/4…. and after it’s 3/3… (3/3 = tolerate up to ONE failure) … but if you have 5/5 and you kill -9 3 of them (or 1 by 1 but very quickly, so they don’t have time to agree on the view of the world), the 2 remaining will stop because they don’t have quorum as they go from 5/5 to 2/5.

  3. Hi Fred,

    Thank you for your advice.

    >This works also on 5.7. I think you are asking this to filter out the metadata >(mysql_innodb_cluster_metadata) schema.
    >I’m not a big fan of filtering out stuff usually but in this case it might help.

    Yes. I thought it required for bidirectional replication for protecting mysql_innodb_cluster_metadata.

    >The metadata is already aware of having multiple clusters and instances.
    >The router is the one not being happy with mutliple clusters for bootstrap.
    >But you can set it up manually.

    I see. I will try it on both MySQL Router and ProxySQL.

    Thanks again.

  4. Hi,
    Thanks for this very useful article.
    Could you please tell us how the asynchronous replication failover that moves the asynchronous slave task on another member of the same group when the Primary Role is moved to another member is done ?

    Thanks
    Youcef HILEM

Leave a Reply to Shlomi NoachCancel Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

As MySQL Community Manager, I am an employee of Oracle and the views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

You can find articles I wrote on Oracle’s blog.