When using MySQL Group Replication, it’s possible that some members are lagging behind the group. Due to load, hardware limitation, etc… This lag can become problematic to keep good certification behavior regarding performance and keep the possible certification failure as low as possible. Bigger is the applying queue bigger is the risk to have conflicts with those not yet applied transactions (this is problematic on Multi-Primary Groups).
Galera users are already familiar with such concept. MySQL Group Replication’s implementation is different 2 main aspects:
- the Group is never totally stalled
- the node having issues doesn’t send flow control messages to the rest of the group asking for slowing down
In fact, every member of the Group send some statistics about its queues (applier queue and certification queue) to the other members. Then every node decide to slow down or not if they realize that one node reached the threshold for one of the queue:
group_replication_flow_control_applier_threshold (default is 25000) group_replication_flow_control_certifier_threshold (default is 25000)
So when group_replication_flow_control_mode
is set to QUOTA
on the node seeing that one of the other members of the cluster is lagging behind (threshold reached), it will throttle the write operations to the the minimum quota. This quota is calculated based on the number of transactions applied in the last second, and then it is reduced below that by subtracting the “over the quota” messages from the last period.
This mean that as contrary of Galera where the threshold is decided on the node being slow, for us in MySQL Group Replication, the node writing a transaction check its threshold flow control values and compare them to the statistics from the other nodes to decide to throttle or not.
You can find more information about Group Replication Flow Control reading Vitor’s article Zooming-in on Group Replication Performance
Fred,
What happens in MySQL Group Replication if some node drastically slows down. Would quota be adjusted or such overly slow node would leave the cluster ?
Hi Peter,
Thank you for your comment. In fact, the cluster will just continue to slow down.
The group quota is calculated based on the number of transactions applied in the last second, and then it is reduced below that by subtracting the “over the quota” messages from the last period (with a 5% minimum). A stopped node would maintain that throughput indefinitely while the blocked node is not applying.
So even if a node is not applying anything (applying queue growing) the node won’t leave the group. The decision to leave the cluster is only based on network reliability. So if the node is not able to apply but continues to receives the events, keeps certifying them and insert them into its relay log, it won’t be expelled from the group.
[…] sustain the same throughput all over the cluster, Group Replication uses a flow control mechanism (see this post to understand how it works). In summary, when a node as an apply queue increasing and reaching a threshold, the other ones […]
[…] sustain the same throughput all over the cluster, Group Replication uses a flow control mechanism (see this post to understand how it works). In summary, when a node as an apply queue increasing and reaching a threshold, the other ones […]
[…] sustain the same throughput all over the cluster, Group Replication uses a flow control mechanism (see this post to understand how it works). In summary, when a node as an apply queue increasing and reaching a threshold, the other ones […]
Fred,
I wonder is there anyway to avoid slave lag? Thank you.
Hi jfxu,
Do you mean in Group Replication ?
Using multiple workers, using LOGICAL_CLOCK as parallel type and keeping small transactions already helps.
Now as writesets apply is asynchronous, you can’t be 100% sure lag won’t happen, but you can check it, for example ProxySQL 2.0 (not yet GA), implements the session_track_gtid (see http://lefred.be/content/mysql-group-replication-read-your-own-write-across-the-group/) to point to nodes having the data. See https://fosdem.org/2018/schedule/event/proxysql_gtid/
Regards,
We have a Multi-Primary MGR with three nodes ( GR1, GR2, GR3 ). GR1 is the only writer, GR3 reach the flow control threshold, if we execute “set global group_replication_flow_control_mode=’DISABLED'” on GR3, will the Flow control disappear?
or we should execute “set global group_replication_flow_control_mode=’DISABLED'” on all node in the group replication to stop flow control?
I am having three nodes in a cluster, lets say node1, node2 and node3 and node1 is the master of this innodb cluster. If the average load increases on master(node1) then after reaching the threshold, group will start flow control for other slower nodes (say node2 or node3). In this case there are some points which need to have clarification on the Flow Control concept.
1. In this case will the master be able to accepts more RW operation from client? Since, being flow control triggered in the group and if certifier or applier queue has increased on secondary nodes, Will master will have any impact on serving the workloads?
2. Will it wait to send the remaining transactions to secondary servers?
3. Will it wait for the certifications of the upcoming new transactions?
4. Will the performance of the master will be impacted due to flow control? As flow control ensures there should be minimum difference between primary and secondaries in terms of backlog.
Hello,
Yes there is an impact. So when group_replication_flow_control_mode is set to QUOTA on the node seeing that one of the other members of the cluster is lagging behind (threshold reached), it will throttle the write operations to the a quota that is calculated based on the number of transactions applied in the last second, and then it is reduced below that by subtracting the “over the quota” messages from the last period.
Check https://www.slideshare.net/lefred.descamps/dataopsbarcelona-2019-deep-dive-into-mysql-group-replication-the-magic-explained from slide 171