The documentation states that “For a transaction to commit, the majority of the group have to agree on the order of a given transaction in the global sequence of transactions.”
This means that as soon as the majority of nodes member of the group ack the writeset reception, certification can start. So, as a picture is worth a 1000 words, this is what it looks like if we take the illustrations from my previous post:
So theoretically, having 2 nodes in one DC and 1 node in another DC shouldn’t be affected by the latency between both sites if writes are happening on the DC with the majority of nodes. But this is not the case.
As you can see on the video above, every 3rd write to the system is affected by the latency. That’s because the system has to wait for the noop (single skip message) from the “distant” node. As Alfranio explained it in his blog post about our homegrown paxos based consensus, XCom is a multi-leader or more precisely a multi-proposer solution. In this protocol, every member has an associated unique number and a reserved slot in the stream of totally ordered messages. There is no leader election and each member is a leader of its own slots in the stream of messages. Members can propose messages for their slots without having to wait for other members. But if they don’t have anything to say, they need to tell it too and this is where we are affected by the latency.
So in conclusion, having a distant node (or with higher latency) as member of a Group slows down the full workload. Not constantly but at least in correlation with its ratio on the cluster. 1/3 for a 3 nodes cluster for example.
At least for now, if you need to have a distant node and using it only for reads, I would advice to use asynchronous replication between your two sites or being prepare to pay the cost of the latency.
Was your test case with Single-Primary mode or Multi-Primary mode? Or does it matter?
Thank you for your question. In fact currently it doesn’t matter. This is indeed somewhere we could implement some optimization.
[…] Replication, that doesn’t hold since before apply all group members do the certification (or at least the majority, with those 10ms). In fact, we can only know that this “master” (writer) will apply […]
As you said : “”So theoretically, having 2 nodes in one DC and 1 node in another DC shouldn’t be affected by the latency between both sites if writes are happening on the DC with the majority of nodes. But this is not the case.””
Any 2021 update concerning this behavior and “latency-price-to-pay”, with your example of having 2 local nodes (majority) + 1 distant node ?
It´s still the same as Group Replication needs consistency to protect your data. It would be a very bad idea to just not care about the other site.
Of course there have been improvements for low latency networks but if the node is away with a very low latency compare to the other nodes at a certain point, even if you can delay you might be affected if write load is faster then the ack .