MySQL InnoDB Cluster: how to handle performance issue on one member ?

lefred
November 9, 2017
General, Group Replication, InnoDB Cluster, MySQL
13 Comments

Sometimes when you are using a MySQL InnoDB Cluster, you might encounter some performance issue because one node becomes dramatically slow.

Why ?

First of all, why ? A node can apply the transactions slower than the other nodes for many different reasons. The most frequents are for example, slower disks (remember, it’s advised to have nodes with the same specifications), but if you are using a RAID controller with a BBU, during the learning cycle, the write performance can decrease by 10 or even more. Another example could be an increase of IO operations that will flood the full IO capacity of the system. Making a local backup or sharing the server resources with some other components could lead in such behavior.

Flow Control

To avoid to have a node lagging to much behind and try to sustain the same throughput all over the cluster, Group Replication uses a flow control mechanism (see this post to understand how it works). In summary, when a node as an apply queue increasing and reaching a threshold, the other ones will slow down the amount of transactions they commit to let the slowing member the time to recover.

And me ?

But in such situation, is there something I can do ?

In fact, yes ! There are two types of cases leading to such situation:

an unforeseen incident
a scheduled operation

For the first case, that’s why you must monitor your MySQL InnoDB Cluster. In the performance_schema.replication_group_member_stats table, you have the mount of transaction in the apply queue (COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE):

select * from replication_group_member_stats\G
*************************** 1. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15100558540053980:3
                                 MEMBER_ID: a2f6f75f-c3b2-11e7-9324-08002718d305
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 0
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: a38ed37b-c3b2-11e7-94e6-08002718d305:1-14,
d59a083e-c3b2-11e7-b358-08002718d305:1-18
            LAST_CONFLICT_FREE_TRANSACTION: d59a083e-c3b2-11e7-b358-08002718d305:18
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 0
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 2. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15100558540053980:3
                                 MEMBER_ID: a301c6af-c3b2-11e7-98c8-08002718d305
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 7
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: a38ed37b-c3b2-11e7-94e6-08002718d305:1-14,
d59a083e-c3b2-11e7-b358-08002718d305:1-18
            LAST_CONFLICT_FREE_TRANSACTION: d59a083e-c3b2-11e7-b358-08002718d305:18
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 8
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 3. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15100558540053980:3
                                 MEMBER_ID: a38ed37b-c3b2-11e7-94e6-08002718d305
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 15
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: a38ed37b-c3b2-11e7-94e6-08002718d305:1-14,
d59a083e-c3b2-11e7-b358-08002718d305:1-18
            LAST_CONFLICT_FREE_TRANSACTION: d59a083e-c3b2-11e7-b358-08002718d305:18
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMselect * from sys.gr_member_routing_candidate_statusOTE_APPLIED: 2
         COUNT_TRANSACTIONS_LOselect * from sys.gr_member_routing_candidate_statusCAL_PROPOSED: 15
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
3 rows in set (0.00 sec)

This is a example of query you can then use:

mysql> SELECT MEMBER_HOST, COUNT_TRANSACTIONS_IN_QUEUE TRX_LOCAL_Q, 
              COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE TRX_APPLY_Q 
       FROM performance_schema.replication_group_member_stats t1 
       JOIN performance_schema.replication_group_members t2 
         ON t2.MEMBER_ID=t1.MEMBER_ID;
+-------------+-------------+-------------+
| MEMBER_HOST | TRX_LOCAL_Q | TRX_APPLY_Q |
+-------------+-------------+-------------+
| mysql1      |           0 |           0 |
| mysql3      |           0 |        8415 |
| mysql2      |           0 |           0 |
+-------------+-------------+-------------+

You can also use a SYS schema view I created (get it here):

select * from sys.gr_member_routing_candidate_status;
+------------------+-----------+---------------------+----------------------+
| viable_candidate | read_only | transactions_behind | transactions_to_cert |
+------------------+-----------+---------------------+----------------------+
| YES              | NO        |                   0 |                    0 |
+------------------+-----------+---------------------+----------------------+

Safe Procedure

So when this happens (expected or not), how should our cluster behave with that member ?

When the incident is too long and flow controls starts to kick in, all the cluster will start slowing down… if this is an incident expected to be long to solve or a maintenance, the first thing we would like to do is to stop sending MySQL traffic to the node (reads and/or writes in case of Multi-Primary cluster). Then we want that the specific node stops sending its statistics to the other nodes.

If you are using ProxySQL as router for your MySQL InnoDB Cluster, as soon as the threshold of transactions_behind is reached, ProxySQL will automatically stop sending traffic to it (see this post). If you are using MySQL-Router, actually the best way to handle this is to add a firewall rule blocking the communication between the router and the MySQL instance running on that member.

Then when we stop sending traffic to the member, we can finally tell it to stop sending its statistics related to the flow control (to be ignored):

mysql> set global group_replication_flow_control_mode='DISABLED';

Now that this node won’t trigger any flow control anymore, the cluster will run as its optimal speed and this provides you extra time to fix the problem or finish the maintenance. The node is still part of the cluster !… it’s just lagging and nobody cares.

That’s why in a multi-primary setup, it’s recommended to not write to it as conflicts may happen very often and as it’s lagging, reading old data might also not be optimal.

As soon as the maintenance is finished and the queue recovered, you can set it back to ‘QUOTA’. This feature works as expected in 8.0.3, on lower releases the statistics were not completely ignored and flow control was kicked anyway.

Let’s see this in action in the following video cast:

13 Comments

MySQL InnoDB Cluster: how to handle performance issue on one member ? - SSWUG.ORG

December 27, 2017 / 09:10 Reply

[…] Read More (Community content) […]
Amit Wadkar

October 4, 2018 / 16:51 Reply

Can you please write a post for InnoDB cluster monitoring (in detail)
Saeed

October 17, 2018 / 09:45 Reply

Can you please help us with following issue?
why is the auto increment option is acting as in group replication after configuration of MySQL innodb cluster as:
auto_increment_increment=7 and offset is 2
while there is only one primary node for read and write so it should be like
auto_increment_increment=1 and offset=2 .
Nandish

April 5, 2020 / 21:12 Reply

Getting below error while creating the 2nd function gr_member_in_primary_partition()
Error :ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ” at line 9
- lefred
  
  April 6, 2020 / 10:49 Reply
  
  Hi Nandish,
  
  Thank you for trying MySQL InnoDB Cluster.
  
  Which version of MySQL are you using ?
  
  The latest script to add this view is here: https://gist.github.com/lefred/6f79fd02d333851b8d18f52716f04d91
  
  Regards,
Felix

October 28, 2020 / 16:24 Reply

hi lefred,
i’m just trying the innodb cluster, and find its performance is kind of bad. I used yahoo’s opensource benchmark tool YCSB to make the performace test, and found: when insert 10000 items, the throughput is nearly 1300 ops/sec, by contract, when using mysql server alone, the throughput is nearly 3500+. I have two quetions:
1. is my performace normal or too slow?
2. which vairiables should we focus on , to tune the performance?

here is my environment details:
1. mysql version: 8.0.21, single-primary
2. vm: 16c 32G ,100G ssd,
cpu： Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
3. variables for tuning
innodb_buffer_pool_size=20G
# 480M
innodb_log_file_size=503316480
innodb_log_files_in_group=4
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT
tmp_table_size=256M
max_heap_table_size=256M
#sync_binlog=1000
gtid_mode=ON
enforce_gtid_consistency=ON
master_info_repository=TABLE
relay_log_info_repository=TABLE
binlog_checksum=NONE
log_slave_updates=ON
log_bin=binlog
binlog_format=ROW
max_connections=3000
max_allowed_packet=64M

sysctl.conf
fs.aio-max-nr = 1048576
fs.file-max = 681574400
# 20G
kernel.shmmax = 21737418240
# default 4096
kernel.shmmni = 4096
kernel.sem = 1024 32000 100 2048
net.ipv4.ip_local_port_range = 9000 65000
net.core.rmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_default = 262144
net.core.wmem_max = 16777216
vm.swappiness = 0

hmm, I’m a chinese engineer and not so good at English. Hope i have made myself clear.
Looking forward for your reply. Thanks for your great post.
- lefred
  
  October 28, 2020 / 16:28 Reply
  
  Hi Felix,
  
  Thank you for using MySQL InnoDB Cluster. The most important information is what is the latency between your nodes ?
  
  I don’t know Yahoo!’s benchmark, but 3500 doesn’t seems high to me either.
  
  Regards,
  - Felix
    
    October 28, 2020 / 16:56 Reply
    
    hi lefred,
    Thanks for your reply. I don’t think “latency between nodes” is an issue, the nodes are in a same network segment, and sharing the 10.0.0.0/24 ip.
    
    Why my cluster’s performance turns out to be so bad. Is there any mistakes in my configuration files? Do you have any best practices for the my.cnf file?
    
    And also, if the innodb cluster server tunes well, how good will be the insert throughput with a threads of 50?
    
    Regards
    - lefred
      
      October 28, 2020 / 17:01 Reply
      
      please verify the latency, and do they all have same specs ?
      - Felix
        
        October 28, 2020 / 17:15
        
        hi lefred,
        mysql nodes: 10.0.0.233/234/235
        router: 10.0.0.231
        YCSB: 10.0.0.230
        [root@node3 ycsb-0.17.0]# tracepath 10.0.0.234
        1?: [LOCALHOST] pmtu 1500
        1: 10.0.0.234 0.306ms reached
        1: 10.0.0.234 0.181ms reached
        Resume: pmtu 1500 hops 1 back 1
        [root@node3 ycsb-0.17.0]# tracepath 10.0.0.233
        1?: [LOCALHOST] pmtu 1500
        1: 10.0.0.233 0.217ms reached
        1: 10.0.0.233 0.183ms reached
        Resume: pmtu 1500 hops 1 back 1
        [root@node3 ycsb-0.17.0]# tracepath 10.0.0.235
        1?: [LOCALHOST] pmtu 1500
        1: 10.0.0.235 0.321ms reached
        1: 10.0.0.235 0.160ms reached
        [root@node3 ycsb-0.17.0]# tracepath 10.0.0.231
        1?: [LOCALHOST] pmtu 1500
        1: iZwz9bhr4680qgnhkweovbZ 0.187ms reached
        1: iZwz9bhr4680qgnhkweovbZ 0.128ms reached
        Resume: pmtu 1500 hops 1 back 1
        
        Mysql nodes all have the same specs.
      - lefred
        
        October 28, 2020 / 17:18
        
        How do you connect to your cluster ? via MySQL Router ?
Felix

October 28, 2020 / 17:25 Reply

Hi lefred,
Yes, when do the benchmark test, i connect the server via router. I’ve also tried directly connect the server via cluster’s r/w node’s 3306 port. The performace data shows the same.

Regards
- Adriaan de Haan
  
  October 30, 2020 / 21:39 Reply
  
  Did you ever determine the reason for the slowness?

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

MySQL InnoDB Cluster: how to handle performance issue on one member ?

Why ?

Flow Control

And me ?

Safe Procedure

follow me

Subscribe to Blog via Email

13 Comments

Leave a ReplyCancel Reply