MySQL InnoDB Cluster : avoid split-brain while forcing quorum

Disclaimer: it’s better to never change the value of group_replication_unreachable_majority_timeout

This post provides you an overview of what could happen in the worse case but may lead in a very unstable setup

We saw yesterday that when an issue (like network splitting), it’s possible to remain with a partitioned cluster where none of the partition have quorum (majority of members). For more info read how to manage a split-brain situation.

If your read the previous article you notice the red warning about forcing the quorum. As an advice is never too much, let me write it down again here : “Be careful that the best practice is to shutdown the other nodes to avoid any kind of conflicts if they reappear during the process of forcing quorum“.

But if some network problem is happening it might not be possible to shutdown those other nodes. Would it be really bad ?

YES !

Split-Brain

Remember, we were in this situation:

We decided to force the quorum on one of the nodes (maybe the only one we could connect to):

But what could happen if while we do this, or just after, the network problem got resolved ?

In fact we will have that split-brain situation we would like to avoid as much as possible.

Details

So what happen ? And why ?

When we ran cluster.forceQuorumUsingPartitionOf('clusteradmin@mysql1'), this is what we could read in the MySQL error log of that server:

[Warning] [MY-011498] [Repl] Plugin group_replication reported: 
'The member has resumed contact with a majority of the members in the group.
Regular operation is restored and transactions are unblocked.'
[Warning] [MY-011499] [Repl] Plugin group_replication reported:
'Members removed from the group: mysql2:3306, mysql3:3306'

The node ejected the other nodes of the cluster and of course no decision was communicate to these servers are they were not reachable anyway.

Now when the network situation was solved, this is what we could read on mysql2:

[Warning] [MY-011494] [Repl] Plugin group_replication reported: 
'Member with address mysql3:3306 is reachable again.'
[Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The
member has resumed contact with a majority of the members in the group.
Regular operation is restored and transactions are unblocked.'
[Warning] [MY-011499] [Repl] Plugin group_replication reported:
'Members removed from the group: mysql1:3306

Same on mysql3, this means these two nodes reached majority together and ejected mysql1 from “their” cluster.

On mysql1, we can see in performance_schema:

mysql> select * from performance_schema.replication_group_members\G
************************** 1. row **************************
CHANNEL_NAME: group_replication_applier
MEMBER_ID: fb819b30-5b90-11e9-bf8a-08002718d305
MEMBER_HOST: mysql4
MEMBER_PORT: 3306
MEMBER_STATE: ONLINE
MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.16
1 row in set (0.0013 sec)

An on mysql2 and mysql3:

mysql> select * from performance_schema.replication_group_members\G
************************** 1. row **************************
CHANNEL_NAME: group_replication_applier
MEMBER_ID: 4ff0a33f-5c49-11e9-abc9-08002718d305
MEMBER_HOST: mysql6
MEMBER_PORT: 3306
MEMBER_STATE: ONLINE
MEMBER_ROLE: SECONDARY
MEMBER_VERSION: 8.0.16
************************** 2. row **************************
CHANNEL_NAME: group_replication_applier
MEMBER_ID: f8ac8d14-5b90-11e9-a22a-08002718d305
MEMBER_HOST: mysql5
MEMBER_PORT: 3306
MEMBER_STATE: ONLINE
MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.16

This is of course the worse situation that could happen when dealing with a cluster.

Solution

The solution is to prevent the nodes not being part of the the forced quorum partition to agree making their own group as they will have a majority.

This can be achieve by setting these variables on an majority of nodes (on two servers if your InnoDB Cluster is made of 3 nodes for example):

When I fixed again my cluster and all were again online, I changed these settings on mysql1 and mysql2:

set global group_replication_unreachable_majority_timeout=30;
set global group_replication_exit_state_action = 'ABORT_SERVER';

This means that if there a problem and the node is not able to join the majority after 30 seconds it will go in ERROR state and then shutdown `mysqld`.

Pay attention that the 30sec is only an example. The time should allow me to remove that timer on the node I want to use for forcing the quorum (mysql1 in the example) but also be sure that time is elapsed on some nodes I can’t access to be sure they removed themselves from the group (mysql2 in the example).

So, if we try again with our example, once the network problem is happening, after 30sec, we can see in mysql2‘s error log that is working as expected:

[ERROR] [MY-011711] [Repl] Plugin group_replication reported: 'This member could not reach 
a majority of the members for more than 30 seconds. The member will now leave
the group as instructed by the group_replication_unreachable_majority_timeout
option.'
[ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically
set into read only mode after an error was detected.'
[Warning] [MY-013373] [Repl] Plugin group_replication reported: 'Started
auto-rejoin procedure attempt 1 of 1'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported:
'[GCS] Timeout while waiting for the group communication engine to exit!'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported:
'[GCS] The member has failed to gracefully leave the group.'
[System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier'
executed'. Previous state master_host='', master_port= 0,
master_log_file='', master_log_pos= 798,
master_bind=''. New state master_host='', master_port= 0,
master_log_file='', master_log_pos= 4, master_bind=''.
[ERROR] [MY-011735] [Repl] Plugin group_replication reported:
'[GCS] Error connecting to the local group communication engine instance.'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported:
'[GCS] The member was unable to join the group. Local port: 33061'
[Warning] [MY-013374] [Repl] Plugin group_replication reported:
'Timeout while waiting for a view change event during the auto-rejoin procedure'
[Warning] [MY-013375] [Repl] Plugin group_replication reported:
'Auto-rejoin procedure attempt 1 of 1 finished.
Member was not able to join the group.'
[ERROR] [MY-013173] [Repl] Plugin group_replication reported:
'The plugin encountered a critical error and will abort:
Could not rejoin the member to the group after 1 attempts'
[System] [MY-013172] [Server] Received SHUTDOWN from user .
Shutting down mysqld (Version: 8.0.16).
[Warning] [MY-010909] [Server] /usr/sbin/mysqld:
Forcing close of thread 10 user: 'clusteradmin'.
[Warning] [MY-010909] [Server] /usr/sbin/mysqld:
Forcing close of thread 35 user: 'root'.
[ERROR] [MY-011735] [Repl] Plugin group_replication reported:
'[GCS] The member is leaving a group without being on one.'
[System] [MY-010910] [Server] /usr/sbin/mysqld:
Shutdown complete (mysqld 8.0.16) MySQL Community Server - GPL.
[Warning] [MY-010909] [Server] /usr/sbin/mysqld: Forcing close
of thread 10 user: 'clusteradmin'.
[Warning] [MY-010909] [Server] /usr/sbin/mysqld: Forcing close
of thread 35 user: 'root'.
[ERROR] [MY-011735] [Repl] Plugin group_replication reported:
'[GCS] The member is leaving a group without being on one.'
[System] [MY-010910] [Server] /usr/sbin/mysqld:
Shutdown complete (mysqld 8.0.16) MySQL Community Server - GPL

And when the quorum has been forced on mysql1, as soon as the network issue is resolved, none will join the Group and the DBA will have to use the shell to perform cluster.rejoinInstance(instance) or restart mysqld on the instances that shutdown themselves.

Conclusion

So as you can see, by default MySQL InnoDB Cluster and Group Replication is very protective for split-brain situation. And it can even be enforced to avoid problem when human interaction is needed.

The rule of the thumb to avoid problem is to set group_replication_unreachable_majority_timeoutto something you can deal with and group_replication_exit_state_action to ABORT_SERVER on (total amount of members in the cluster /2 )+1 as integer 😉

If you have 3 nodes, on 2 then ! Of course it might be much simpler to set it on all nodes.

Be aware that if you don’t react in the time frame defined by group_replication_unreachable_majority_timeout, all your servers will shutdown and you will have to restart one.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

16 Comments

  1. Please DO NOT USE THIS IN PRODUCTION, by setting group_replication_unreachable_majority_timeout to not default value of 0 (infinite), network partition handling is no longer automatic, meaning: members will decide to leave the group if they lose connection with the majority of the group. (in the blog post example, a 30 second network outage CAUSES ALL MEMBERS TO SHUT DOWN leaving you with no cluster).

  2. Hi lefred ,

    I have tested your scenarios more than 5 times , and the scenario that might stop my application from connect to cluster in case of network failure was very simple :
    1. One router setup
    2. 3 DB nodes ( Cluster setup done)
    simple what i did i just disable the network interface from the VM and the result was
    router logs :
    WARNING [7f5eb378c700] Failed connecting with Metadata Server mysqldb01.localhost.com:3306: Can’t connect to MySQL server on ‘mysqldb01.localhost.com’ (113) (2003)
    2020-03-06 14:04:01 metadata_cache ERROR [7f5eb378c700] Failed to connect to metadata server 6ffa0517-4137-11ea-8cf9-0800271c9427
    2020-03-06 14:04:01 metadata_cache DEBUG [7f5eb378c700] Connected with metadata server running on mysqldb02.localhost.com:3306
    2020-03-06 14:04:01 metadata_cache DEBUG [7f5eb378c700] Updating metadata information for cluster ‘testCluster’
    2020-03-06 14:04:01 sql DEBUG [7f5eb378c700] Executing query: SELECT R.replicaset_name, I.mysql_server_uuid, I.role, I.weight, I.version_token, H.location, I.addresses->>’$.mysqlClassic’, I.addresses->>’$.mysqlX’ FROM mysql_innodb_cluster_metadata.clusters AS F JOIN mysql_innodb_cluster_metadata.replicasets AS R ON F.cluster_id = R.cluster_id JOIN mysql_innodb_cluster_metadata.instances AS I ON R.replicaset_id = I.replicaset_id JOIN mysql_innodb_cluster_metadata.hosts AS H ON I.host_id = H.host_id WHERE F.
    2020-03-06 14:04:01 sql DEBUG [7f5eb378c700] Done executing query
    2020-03-06 14:04:01 metadata_cache DEBUG [7f5eb378c700] Updating replicaset status from GR for ‘default’
    2020-03-06 14:04:04 metadata_cache WARNING [7f5eb378c700] While updating metadata, could not establish a connection to replicaset ‘default’ through mysqldb01.localhost.com:3306

    and DB Node 3
    [ERROR] [MY-011495] [Repl] Plugin group_replication reported: ‘This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.’

    how to react in this kind of scenario ?

  3. How does Innodb cluster behaves in case of 7 nodes cluster in multi-primary mode. This 7 nodes are spitted across 2 DCs. Meaning, 4 nodes from DC-1 and 3 nodes from DC-2 forming one single InnoDB cluster, each DC having its own Router. This architecture is meant for Active-Active on two DCs.
    Now In case, if there is network partition amongst the nodes and 4 nodes at DC-1 are separated from the other 3 nodes of DC-2 then :~~~~~>>
    Will the Router on DC-2 be able to make client connection and write on the separated nodes on DC-2. If this happens, data will be inconsistent? OR the connection will be routed to DC-1 router where it can see the majority of nodes in the cluster online (4 nodes)?
    Ideally it should be routed the connection to DC-1 since the majority of certification will be successfully done from DC-1 nodes.

    Please suggest how does Innodb cluster will behave and let us make sure what will be the ideal behaviour in this case.

    • For automated HA, rules of 3 is always the key rule !

      3 nodes, 3 datacenters…

      4 +3, if there is a network issue isolating the two sites, the one with 4 will be the one where the members will still work. In case of dc with 4 nodes blow up (or is not reachable), an operator will have to manually explicitly force the remaining nodes (minority) in DC2 to take the lead by forcing the Quorum.

      • About the last scenario: An operator will have to manually explicitly force the remaining nodes (minority) in DC2 to take the lead by forcing the Quorum.

        –>> We have tested the Quorum situation as well, that if out of 7 nodes even 5 nodes are down, cluster is able to accept the client connection on remaining 2 nodes left online in the cluster. Instead it should have lost quorum in this case, since majority of nodes (4 to 5 nodes) are missing from the cluster.
        Hence we are uncertain about this behaviour that, whether both the DCs will be able to write on the nodes (as all nodes are R/W). This seems to be risky and will make the data inconsistent.

          • This is what I meant too, this is a wrong test 😉

            Your Group is 7 of 7, then you kill one it becomes 6/7 for 1 min (time to eject the missing of the group) and then you are 6 of 6… and then you kill another one 5 of 6…and it becomes 5 of 5… and so on… you get the idea… 😉

            If you kill 4 of 7 in one shot, you will get 3 of 7 –> no quorum

          • Thanks for the quick support!

            We will test this case by killing 4 of 7 in one shot, to check quorum loss situation.

  4. I have one last question, in 4 + 3, If one node goes down in DC1, this will make 3 + 3 cluster (even no of nodes), Immediately after this, a network partition happen. What will happen at this stage?. I think the whole cluster goes down because they can’t form any majority.

  5. So, in 3 + 3 case, we need to manually down on of the node to get Odd number of nodes (will make 5 = 3+2) otherwise, We have to prepare for the down time,

    Or Is there anything, which can make a cluster in always odd no of nodes, like cluster automatically drop one node If it find even no nodes (Finger crossed).

      • Could you please elaborate, how it will benefit in this case. As per our views, 3+3+3 in Single Multi-primary InnoDB Cluster across 3 DCs will behave similarly in case of network partition between all three DCs. So, it will form split brain of 3/3 + 3/3 +3/3.

        If I am not wrong, whole cluster will go down.

        • With 3 DC’s what are the chances that 2 blow up at the same time? So you will always have 2/3 online. Now if 1 node dies on each, your are 6 of 9 … no problem, then one DC crashes, 4 of 9 still quorum

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.