Recently, my friend Marco Tusa(MySQL Daddy or the Grinch) wrote his first impression on MySQL Group Replication (part of InnoDB Cluster). And his conclusion was not that positive. But when I analyze his setup, I understand that his assumptions were not so right.
Let’s try to explain what were the issues and why his test wasn’t correct.
Before commenting Marco’s tests, I would like to clarify the flow-control implementation in Group Replication:
We designed the flow-control feature in Group Replication as a safety measure for delaying writer nodes when they consistently exceed the write capacity of the Group, so that a large backlog would not make it hard to switch over from a member to another.
Flow-control is a coarse grained measure, and the default threshold for that safety measure was set to about one second of throughput with small update transactions on modern machines (hence those 25000). When using larger transaction, the threshold can be reduced without harm, but for performance it is better to keep above that 1 second of throughput. But take notice that, in a well balanced system, flow-control is not expected to limit throughput because slaves are able to keep executing transactions as fast as the server delivers them.
I know that GR flow-control can be a complicate concept and that’s why I blogged already about it and you can expect soon more posts related to this topic and especially covering the changes in 8.0.2.
Let’s go back to Marco’s test, my first remark is related to the sizing of the nodes. Are the nodes all equals as recommended ? We don’t know… To what I can see, nodes 3 and 4 are slow applying the workload and this is not related to the wrong flow-control value (see next point below). Also Marco created a cluster of 4 nodes with two in one location and two in another location (10ms). As majority is required for the consensus, the price of these 10ms is always paid.
The second point is related to the value used for flow-control, 25 ? In contrast with Galera, the queues checked for flow-control also include the transactions going through the applier pipeline. This means that very low thresholds will trigger flow-control even when replication is functioning perfectly well and with low latency. In fact, you won’t gain anything from that.
Another misconception, is related to the measurement. Marco suggested that he could safely say that the incoming GTID (last_received_transaction_set from replication_connection_status) is for sure the last apply on the master a node can know about.
In Group Replication, that doesn’t hold since before apply all group members do the certification (or at least the majority, with those 10ms). In fact, we can only know that this “master” (writer) will apply this transaction soon, but may not yet be applied. Certification happens independently at each member once a majority of the members do agree on the transaction delivery. This means that a transaction listed on last_received_transaction_set is certified and queued to apply on this member. On the other members it may already be certified/applied or soon be.
Additionally to this, Marco complains (just a little) that MySQL Group Replication is using binlogs and relay logs… this is completely true ! Our new replication is based on proven technologies, mastered by a lot of people with on top the Group Communication layer. Galera invented the gcache, we reused the relay log 😉
And finally, I completely agree with Marco when he says that Group Replication is based on asynchronous replication… but this is exactly the same for Galera (and PXC) where flow-control is totally blocking the cluster which is not the case in Group Replication (with other differences some good, some bad).
So yes, if you change the default or if you don’t follow the recommendations, the experience can hurt (and it’s the same with all HA solutions when playing with data). But the current defaults are good and should provide you a stable environment.
I’m very happy to see that Marco is taking a look at our HA solution and I also suggest to all people interested in MySQL Group Replication to check the new improvement we made in 8.0.2. I am looking forward to Marco’s new post on MySQL Group Replication and grazie per averlo provato (thank you for having tried it)!
Hi Fred first of all thanks for reading the article.
Second as you know, given you know me very well I have no problem at all to say when I am wrong on something, that “oops mistake” but first I need to proved wrong 😀 .
So let see from the start.
1) My blog scope was and still is to set the right expectations agains what a product can do.
Starting from what we have and looking to what is coming (advertise).
In most of the recent articles/presentations/videas we had seen GR and Innodb cluster, presented as a read scaling solution. Yes I know MySQL replication was presented as that as well.
But the point is that whenever presented, and in your comments as well, it seem that GR and its controlling mechanism will allow the final user to set the application to write on a Master and (consistently) read from slave(s).
Now aside al the technical mabojambo and eventually possible polemics we may want to do around how much async galera or GR is, I think what is relevant is the final result.
Playing the dummy, I can say that if I have PXC, I am safe enough to read from other nodes, and expect that the data is align, and it will be, please let us skip the polemics about the ms or less in applying.
Again if as a user I read that I can have similar behaviour with GR, I will set the data layer and start to perform tests, (dummy yes … idiot no).
And wow I will magically see that my Master will fly … my readers will not, at least with the default (no parallel applier and so on).
No about FC … and here I may get it wrong (was following http://mysqlhighavailability.com/gr/doc/performance.html#flow-control & https://dev.mysql.com/worklog/task/?id=9838 & http://lefred.be/content/mysql-group-replication-understanding-flow-control/)
My understanding is that if I reduce the value of 25000 to a smaller number I reduce the interval in the given time (1sec).
So if I reduce it to X (say 2000):
group_replication_flow_control_applier_threshold =2000
group_replication_flow_control_certifier_threshold =2000
“So when group_replication_flow_control_mode is set to QUOTA on the node seeing that one of the other members of the cluster is lagging behind (threshold reached), it will throttle the write operations to the the minimum quota. This quota is calculated based on the number of transactions applied in the last second, and then it is reduced below that by subtracting the “over the quota” messages from the last period.”
I am not expecting the QUOTA to be 2000, but i am expecting the threshold to be.
So while in Galera if I set FC=16 I am expecting something to happen as soon as I reach that value and stay stick to it.
IN GR I was expecting FC action if I set the threshold to X and expecting the QUOTA to be calculated on the base of the THRESHOLD and the Interval.
Of course I may be wrong here… and I will be more than happy to coordinate additional tests with you.
But a side from that, there are facts (number at hands) that show how GR will allow nodes to become stale, diverging from Master (also with the default settings).
So in case of use for read scaling (as often advertise) the risk is to access stale data, unless we use tools like ProxySQL (still testing that) as well.
About measuring the lag, you said I am wrong in using last_received_transaction_set and that is valid only for the single node (which is what I am doing btw), but this raise another question.
How to mesure it? Do you currently have another way to estimate real lag master – other nodes, that is related/visible by FC?
Finally I can build up a perfect environment and have all going smooth, but that is not what happens in real life. Have distributed nodes or network slow down is what happen in real life.
Once more I am available to repeat the tests together, and if it comes up that wow all is working as it should no problem recognizing it.
Also no problem in agreeing trying different tests.
But just saying … I am wrong, and that all is working right when I have the tests done showing a different picture, well that doesn’t sound sweet at all.
Hi Marco!
(To be clear: these are my personal thoughts and opinions and don’t necessarily reflect those of my employer.)
First of all, thank you for trying out Group Replication and helping us to make it better! All feedback is helpful, and even more so “negative” or critical feedback. We want to know the issues, concerns, and pain points that users feel. I very much appreciate your input.
Just a few additional points:
1. I completely agree that the flow control mechanism in Group Replication is a bit opaque. We’ve already done some work in MySQL 8 to provider the user with greater control and visibility around it. We still have plenty more work to do there over time.
2. I agree that Galera’s flow control mechanism is much more aggressive and thus from your perspective more effective (it’s fair when you say that your perspective here would match the average user too). Group Replication favors a minimum throughput on a primary (a RW instance) whereas Galera favors limiting the applier queue size on the acceptors (the “secondaries”, where the user trx was not executed). With Group Replication, you would not see major write stalls for your user transactions because one or more of the other nodes is lagging behind, whereas with Galera you would. So it’s a trade-off, ideally one that a user should make and we would like to make things more configurable over time. So that users can explicitly decide on the tradeoffs they wish to make.
3. It’s hard to objectively compare two products when you misrepresent the behaviors and characteristics of one of them. Both Galera and Group Replication are eventually consistent systems. Neither is “synchronous”. You can say “virtually synchronous” but you’re describing what the system is not, more than what it is. Neither offers consistent reads in multi-master/multi-primary mode or operations. wsrep_sync_wait offers no consistency guarantees, it simply totally orders a null transaction before executing your user transaction. But that only means that all nodes have received the null transaction and it’s in their applier queue as it’s been sent through the consensus protocol. It doesn’t mean that any of the nodes are at any point in their applier queue. There’s no guarantees about what anyone has actually applied/executed. So that, combined with the flow control, simply offers the illusion of synchronicity and consistency in most cases. When you misrepresent how this works and call it “synchronous replication” or “consistent reads” it’s hard to compare products, and it hurts your users as they build applications around false assumptions.
4. We’ve talked about Group Replication / InnoDB clusters as an HA solution today. Our mentions of read scale out are the next phase–phase 2:
https://www.slideshare.net/mattalord/why-mysql-high-availability-matters-71246233/39
The goal there is to support adding RO “slaves” (learners in paxos and paxos-like consensus systems) to the cluster. We also hope to add an option to support synchronous writes, where a another message passing round happens to verify that the transaction has been *applied* everywhere before we return control to the user. From what you’ve said, in your mind we would then have a true “read scale out” solution. You would pay the penalty on write latency, but if you’re traffic is 99.999% reads, it’s more than worth it to have read-your-own-write consistency guarantees.
4A. But XtraDB Cluster does not offer that today either, given your own definition. You can still get stale reads, even when using wsrep_sync_wait (and then it’s not much of a read scale out solution anyway due to the added read latency). You can say that it’s *closer* to being virtually synchronous or closer to being consistent etc., but those are not guaranteed properties of either system today. XtraDB Cluster only has more effective mans to limit the window of inconsistency. That point is valid, but it’s not valid to talk of one as being synchronous and the other asynchronous or one as being consistent and the other not. When mixing inaccurate claims about either product it all gets a bit confusing.
5. These are all reasons why Group Replication operates in single primary mode by default. Then you have automated HA, but w/o any behavioral changes for the DB apps and users. You’re still only ever talking to a single MySQL instance–the PRIMARY–and MySQL Router ensures that your connections get routed to the current PRIMARY. If the PRIMARY fails, then a new PRIMARY will be automatically elected and the Router adjusts its routing information accordingly. It allows you to move from a single MySQL instance to a highly available distributed MySQL service w/o visible changes (around DDL handling, limitations around FKs etc, commit rollbacks, stale reads, …), hiding that complexity from the users and apps. So in the end it’s very similar to all the NoSQL/NewSQL databases out there that use RAFT, such as MongoDB.
5A. When it comes to sending reads to SECONDARY instances, then you have the same consistency issues/concerns that you’ve always had with MySQL replication when sending reads to slaves. (MongoDB won’t even allow reads on secondaries, unless you explicitly enable it with rs.slaveOk().)
All this being said, I think that your final point was/is certainly valid:
XtraDB Cluster 5.7 contains more effective means to limit the window of inconsistency, in the eventually consistent systems, than does InnoDB Cluster 5.7.
We’re working on things in Group Replication (the Database component in the full InnoDB Cluster stack of MySQL+Group Replication, MySQL Router, and MySQL Shell) that I feel will address your concerns and meet your stated use case better than ever before. So please stay tuned, and continue to provide feedback! I can’t thank you enough for trying it out and letting us know your impressions, feelings, thoughts, and results. You’re completely right that your initial impressions–without getting into all the nitty gritty details–are important and indicative of the average user experience.
Thanks again!
Matt
[…] Fred published a post (http://lefred.be/content/mysql-group-replication-is-sweet-but-can-be-sour-if-you-misunderstand-it) in which he was stating, I had publish my blog […]
Matt, tks … for the answers, and I have to go through them … carefully.
But one thing jump to my eyes “With Group Replication, you would not see major write stalls for your user transactions because one or more of the other nodes is lagging behind, whereas with Galera you would. ”
That is also one of the point in the article…
I show in the graphs that Master (writer) WAS significantly affected affected by the other node(s) delaying. That was probably because at the end we had 2 nodes delaying (out of 4) as such 50% of the cluster.
Please also note 2 things.
1) please do not get confuse by the mention of PXC or virtually sync. As I mention the scope was NOT to compare the product but a specific usage as specify (http://www.tusacentral.net/joomla/index.php/work-in-progress/9-temp/192-sweet-and-sour-can-become-bitter.html)
2) I was not providing negative feedback at all. I was clarifying a possible not correct use, exactly because the two products are NOT comparable. But given the kind of advertising around it (GR/Innodb cluster) we (or at least I) got already several customer wondering if they can use PXC or GR/Innodb cluster as immediate replacement one of the other.
Marco,
We already made some changes in FC in 8.0.2. that might be also what you are looking for. Next week (as you know I’m currently traveling), with João’s help, I will show you a way to make better measurement and I will be very happy to discuss all this in Dublin (at Percona Live Eu) with you and the Group Replication Flow Control responsible (Vitor) around some beers (Martini for you).
[…] https://lefred.be/content/mysql-group-replication-is-sweet-but-can-be-sour-if-you-misunderstand-it/ […]