MySQL InnoDB Cluster: how to handle performance issue on one member ?

on
Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

 

Sometimes when you are using a MySQL InnoDB Cluster, you might encounter some performance issue because one node becomes dramatically slow.

Why ?

First of all, why ? A node can apply the transactions slower than the other nodes for many different reasons. The most frequents are for example, slower disks (remember, it’s advised to have nodes with the same specifications), but if you are using a RAID controller with a BBU, during the learning cycle, the write performance can decrease by 10 or even more. Another example could be an increase of IO operations that will flood the full IO capacity of the system. Making a local backup or sharing the server resources with some other components could lead in such behavior.

Flow Control

To avoid to have a node lagging to much behind and try to sustain the same throughput all over the cluster, Group Replication uses a flow control mechanism (see this post to understand how it works). In summary, when a node as an apply queue increasing and reaching a threshold, the other ones will slow down the amount of transactions they commit to let the slowing member the time to recover.

And me ?

But in such situation, is there something I can do ?

In fact, yes ! There are two types of cases leading to such situation:

  1. an unforeseen incident
  2. a scheduled operation

For the first case, that’s why you must monitor your MySQL InnoDB Cluster. In the performance_schema.replication_group_member_stats table, you have the mount of transaction in the apply queue (COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE):

select * from replication_group_member_stats\G
*************************** 1. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15100558540053980:3
                                 MEMBER_ID: a2f6f75f-c3b2-11e7-9324-08002718d305
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 0
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: a38ed37b-c3b2-11e7-94e6-08002718d305:1-14,
d59a083e-c3b2-11e7-b358-08002718d305:1-18
            LAST_CONFLICT_FREE_TRANSACTION: d59a083e-c3b2-11e7-b358-08002718d305:18
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 0
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 2. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15100558540053980:3
                                 MEMBER_ID: a301c6af-c3b2-11e7-98c8-08002718d305
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 7
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: a38ed37b-c3b2-11e7-94e6-08002718d305:1-14,
d59a083e-c3b2-11e7-b358-08002718d305:1-18
            LAST_CONFLICT_FREE_TRANSACTION: d59a083e-c3b2-11e7-b358-08002718d305:18
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 8
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 3. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15100558540053980:3
                                 MEMBER_ID: a38ed37b-c3b2-11e7-94e6-08002718d305
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 15
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: a38ed37b-c3b2-11e7-94e6-08002718d305:1-14,
d59a083e-c3b2-11e7-b358-08002718d305:1-18
            LAST_CONFLICT_FREE_TRANSACTION: d59a083e-c3b2-11e7-b358-08002718d305:18
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMselect * from sys.gr_member_routing_candidate_statusOTE_APPLIED: 2
         COUNT_TRANSACTIONS_LOselect * from sys.gr_member_routing_candidate_statusCAL_PROPOSED: 15
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
3 rows in set (0.00 sec)

This is a example of query you can then use:

mysql> SELECT MEMBER_HOST, COUNT_TRANSACTIONS_IN_QUEUE TRX_LOCAL_Q, 
              COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE TRX_APPLY_Q 
       FROM performance_schema.replication_group_member_stats t1 
       JOIN performance_schema.replication_group_members t2 
         ON t2.MEMBER_ID=t1.MEMBER_ID;
+-------------+-------------+-------------+
| MEMBER_HOST | TRX_LOCAL_Q | TRX_APPLY_Q |
+-------------+-------------+-------------+
| mysql1      |           0 |           0 |
| mysql3      |           0 |        8415 |
| mysql2      |           0 |           0 |
+-------------+-------------+-------------+

You can also use a SYS schema view I created (get it here):

select * from sys.gr_member_routing_candidate_status;
+------------------+-----------+---------------------+----------------------+
| viable_candidate | read_only | transactions_behind | transactions_to_cert |
+------------------+-----------+---------------------+----------------------+
| YES              | NO        |                   0 |                    0 |
+------------------+-----------+---------------------+----------------------+

Safe Procedure

So when this happens (expected or not), how should our cluster behave with that member ?

When the incident is too long and flow controls starts to kick in, all the cluster will start slowing down… if this is an incident expected to be long to solve or a maintenance, the first thing we would like to do is to stop sending MySQL traffic to the node (reads and/or writes in case of Multi-Primary cluster). Then we want that the specific node stops sending its statistics to the other nodes.

If you are using ProxySQL as router for your MySQL InnoDB Cluster, as soon as the threshold of transactions_behind is reached, ProxySQL will automatically stop sending traffic to it (see this post). If you are using MySQL-Router, actually the best way to handle this is to add a firewall rule blocking the communication between the router and the MySQL instance running on that member.

Then when we stop sending traffic to the member, we can finally tell it to stop sending its statistics related to the flow control (to be ignored):

mysql> set global group_replication_flow_control_mode='DISABLED';

Now that this node won’t trigger any flow control anymore, the cluster will run as its optimal speed and this provides you extra time to fix the problem or finish the maintenance. The node is still part of the cluster !… it’s just lagging and nobody cares.

That’s why in a multi-primary setup, it’s recommended to not write to it as conflicts may happen very often and as it’s lagging, reading old data might also not be optimal.

As soon as the maintenance is finished and the queue recovered, you can set it back to ‘QUOTA’. This feature works as expected in 8.0.3, on lower releases the statistics were not completely ignored and flow control was kicked anyway.

Let’s see this in action in the following video cast:

Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked *

recent

Last Tweets Last Tweets

Locations of visitors to this page
categories