MySQL Group Replication: understanding Flow Control

When using MySQL Group Replication, it’s possible that some members are lagging behind the group. Due to load, hardware limitation, etc… This lag can become problematic to keep good certification behavior regarding performance and keep the possible certification failure as low as possible. Bigger is the applying queue bigger is the risk to have conflicts with those not yet applied transactions (this is problematic on Multi-Primary Groups).

Galera users are already familiar with such concept. MySQL Group Replication’s implementation is different 2 main aspects:

  • the Group is never totally stalled
  • the node having issues doesn’t send flow control messages to the rest of the group asking for slowing down

In fact, every member of the Group send some statistics about its queues (applier queue and certification queue) to the other members. Then every node decide to slow down or not if they realize that one node reached the threshold for one of the queue:

group_replication_flow_control_applier_threshold   (default is 25000)
group_replication_flow_control_certifier_threshold (default is 25000)

So when group_replication_flow_control_mode is set to QUOTA on the node seeing that one of the other members of the cluster is lagging behind (threshold reached), it will throttle the write operations to the the minimum quota. This quota is calculated based on the number of transactions applied in the last second, and then it is reduced below that by subtracting the “over the quota” messages from the last period.

This mean that as contrary of Galera where the threshold is decided on the node being slow, for us in MySQL Group Replication, the node writing a transaction check its threshold flow control values and compare them to the statistics from the other nodes to decide to throttle or not.

You can find more information about Group Replication Flow Control reading Vitor’s article Zooming-in on Group Replication Performance

 

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

10 Comments

  1. Fred,

    What happens in MySQL Group Replication if some node drastically slows down. Would quota be adjusted or such overly slow node would leave the cluster ?

  2. Hi Peter,
    Thank you for your comment. In fact, the cluster will just continue to slow down.
    The group quota is calculated based on the number of transactions applied in the last second, and then it is reduced below that by subtracting the “over the quota” messages from the last period (with a 5% minimum). A stopped node would maintain that throughput indefinitely while the blocked node is not applying.

    So even if a node is not applying anything (applying queue growing) the node won’t leave the group. The decision to leave the cluster is only based on network reliability. So if the node is not able to apply but continues to receives the events, keeps certifying them and insert them into its relay log, it won’t be expelled from the group.

  3. […] sustain the same throughput all over the cluster, Group Replication uses a flow control mechanism (see this post to understand how it works). In summary, when a node as an apply queue increasing and reaching a threshold, the other ones […]

  4. […] sustain the same throughput all over the cluster, Group Replication uses a flow control mechanism (see this post to understand how it works). In summary, when a node as an apply queue increasing and reaching a threshold, the other ones […]

  5. […] sustain the same throughput all over the cluster, Group Replication uses a flow control mechanism (see this post to understand how it works). In summary, when a node as an apply queue increasing and reaching a threshold, the other ones […]

  6. Fred,

    I wonder is there anyway to avoid slave lag? Thank you.

  7. We have a Multi-Primary MGR with three nodes ( GR1, GR2, GR3 ). GR1 is the only writer, GR3 reach the flow control threshold, if we execute “set global group_replication_flow_control_mode=’DISABLED'” on GR3, will the Flow control disappear?
    or we should execute “set global group_replication_flow_control_mode=’DISABLED'” on all node in the group replication to stop flow control?

  8. I am having three nodes in a cluster, lets say node1, node2 and node3 and node1 is the master of this innodb cluster. If the average load increases on master(node1) then after reaching the threshold, group will start flow control for other slower nodes (say node2 or node3). In this case there are some points which need to have clarification on the Flow Control concept.

    1. In this case will the master be able to accepts more RW operation from client? Since, being flow control triggered in the group and if certifier or applier queue has increased on secondary nodes, Will master will have any impact on serving the workloads?
    2. Will it wait to send the remaining transactions to secondary servers?
    3. Will it wait for the certifications of the upcoming new transactions?
    4. Will the performance of the master will be impacted due to flow control? As flow control ensures there should be minimum difference between primary and secondaries in terms of backlog.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.