It’s a wonderful news, we have released MySQL 5.7.17 with Group Replication Plugin (GA quality).
From the definition, Group Replication is a multi-master update everywhere replication plugin for MySQL with built-in conflict detection and resolution, automatic distributed recovery, and group membership.
So we can indeed compare this solution with Galera from Codership which is a Replication Plugin implementing the WSREP API. WSREP, Write Set Replication, extends the replication API to provide all the information and hooks required for true multi-master, “virtually synchronous” replication.
With Group Replication, MySQL implemented all this information in the plugin itself. Our engineers leveraged existing standard MySQL infrastructure (GTIDs, Multi-Source Replication, Multi-threaded Slave Applier, Binary Logs,…) and prepared InnoDB since several releases to provide all the necessary features like High Priority Transaction in InnoDB since 5.7.6 for example.
This means that Group Replication is based on well known and trusted components and makes the integration and the adoption an easier process.
Both solutions are based on Replicated Database State Machine theory.
What are the similarities between both solutions ?
MySQL Group Replication and Galera use write sets. A write set is a set of globally unique identifiers of each
logical item changed by the transaction when it executed (item may be a row, a table, a metadata object, …).
So, Group Replication and Galera use ROW binary log events, and together with the transaction data, its writesets are streamed synchronously from the server that received the write (Master for that specific transaction) to the other members/nodes in the cluster.
Then they will certify the writeset (transaction) locally and asynchronously queue the accepted changes to be applied.
Then both solutions will make use of the write sets to check
for conflicts between concurrent transactions executing on
different replicas. This procedure is named certification. So they will certify the write set locally and asynchronously queue the accepted changes to be applied.
Both implementations use a group communication engine that manages quorums, membership, message passing, …
So what is different then ?
The biggest difference is that Group Replication (GR) is a plugin for MySQL, made by MySQL, packaged and distributed with MySQL by default. Also, GR is available and supported on all MySQL Platforms: Linux, Windows, Solaris, OSX, FreeBSD.
As said before, GR also uses all the same infrastructure that people are used to (binlogs, GTIDs, …). In addition to familiarity and trust, this makes it much easier to integrate a Group Replication cluster into more complex topologies where different asynchronous master/slaves are also involved.
There are many implementation differences. I’ll list them in those categories:
- Group Communication
- InnoDB
- Binary Log & Snapshot
- GTID, Master-Master & Master-Slaves
- Monitoring
Group Communication
Galera is using a proprietary group communication system layer, which implements a virtual synchrony QoS which is based on the Totem Single-ring Ordering protocol. MySQL Group Replication use a Group Communication System (GCS) based on a variant of the popular Paxos algorithm.
This allows GR to achieve much more optimal network performance, thus greatly reducing the overall latency within the distributed system (more information about this in Vitor’s blog post). In fact the more nodes you add (currently GR supports up to 9 nodes per group), more the commit time will increase in Galera where it will stay almost stable with GR. This is due to GR using a peer-to-peer style communication versus Galera’s token ring.
InnoDB
Compared to Galera that needs to patch MySQL and add an extra layer to be able to kill a local transaction when there are certification conflicts, Group Replication uses the High Priority Transactions in InnoDB, which allows Group Replication to ensure that conflicts are detected and handled properly.
Binary Log
Even if it requires binlog_format=ROW
, Galera doesn’t need to have the binary logs enabled. It’s anyway recommended to enable them for point-in-time recovery, asynchronous replication to a slave out of the cluster or for forensic purpose. So Galera doesn’t use the binary log to perform the incremental synchronization between the nodes.
Galera uses an extra file called gcache (Galera Cache). This file was not resilient since the last Galera release (3.19, and it’s not guaranteed). The data stored inside of this file can’t be used for anything else than IST (Incremental State Transfer).
In Group Replication, we keep using the binary log files for that purpose. So if a node was out for a short period, it will perform the synchronization from the binary logs of the node that has been elected as donor. This is called IST in Galera (from the gcache when data is available) and Automated Distributed Recovery in GR.
Basing our solution on binary logs allows us to have the data safely persisted (flushed and sync’d). Also this is a well known format and as mentioned above, binary logs server many purposes too (distributed recovery, asynchronous replication, Point-in-time recovery, streaming or piping to other system like Kafka… and can even be used to perform schema changes!).
The Galera Cache file is used to store the writesets in circular buffer style and has its size pre-defined. So it might happen that IST is impossible and that a full state transfer is required (SST).
And this is maybe one of the advantage of Galera for people having a lot of network or hardware problems: the full data provisioning. It’s true that with Galera, when a new node is added to the cluster, it’s possible to not prepare the new node in advance. This is very convenient for newbies. We understand the need for a better solutions. Currently this process is pretty much the same as provisioning a slave when using regular replication.
However, every Galera experienced DBAs can also tell you that they try to avoid SST as much as possible.
GTID, Master-Master, Master-Slave
Like Galera, GR has one attributed UUID for the cluster. The difference with Galera is that even if all nodes in the same Group share the same UUID, in GR they have their own sequence number range (defined by group_replication_gtid_assignment_block_size).
And like Galera if your workload allows it (more to come in a future post), you can use a multi-master cluster and write on all the nodes at the same time. But as this is some how synchronized, that won’t scale up writes anyway. So, even if it’s not really advertised in Galera, with Group Replication we recommend to use a single-master at the time to reduce the probability of conflicts.
Writing on one single master also allows to avoid probable issues when dealing with schema changes and modifying data on another node at the same time.
This is why by default, in MySQL Group Replication, your cluster runs in Single Primary Mode (controller by group_replication_single_primary_mode). This means the Group itself will automatically elect a leader and keep managing this task when the group changes (in case of failure of the leader). Don’t forget that Group Replication is first a High Availability solution.
Of course, even when using te cluster in Single Primary Mode, the limitations or recommendations related to Group Replication still apply (like disabling binlog checksum, using only InnoDB tables, let the Group manage the auto_increment related variables, …), but there are some less.
Monitoring
Unlike Galera that uses only status variables (if I remember correctly), Group Replication uses Performance Schema to expose information. The Galera fork present in Percona XtraDB Cluster also uses performance_schema in its 5.7 version.
For example, in Galera it’s not easy to find from any node which others nodes are in the cluster and what’s their status. With Group Replication we expose all that in performance_schema:
select * from replication_group_members\G *************************** 1. row *************************** CHANNEL_NAME: group_replication_applier MEMBER_ID: e8fe7c39-ada4-11e6-8891-08002718d305 MEMBER_HOST: mysql3 MEMBER_PORT: 3306 MEMBER_STATE: ONLINE *************************** 2. row *************************** CHANNEL_NAME: group_replication_applier MEMBER_ID: e920a7cf-ada4-11e6-8971-08002718d305 MEMBER_HOST: mysql2 MEMBER_PORT: 3306 MEMBER_STATE: ONLINE *************************** 3. row *************************** CHANNEL_NAME: group_replication_applier MEMBER_ID: e92186b1-ada4-11e6-ba00-08002718d305 MEMBER_HOST: mysql1 MEMBER_PORT: 3306 MEMBER_STATE: ONLINE
As you can see, the Performance_Schema tables offer an easy and intuitive way to get information and stats on an individual node and the group as a whole.
If you are using a solution that requires an health-check to monitor the nodes and decide of the routing from the application to the right node(s), you can also base your script on sys schema
that provides views with all the information you need to make the right routing decision.
Conclusion
So, it’s really true that Galera benefits from many years of experience and has still many more features, some major like the arbitrator[1], or minor like node weight, sync wait, segments, … but Group Replication is a solid contender, certainly if you are looking for great performance.
If you think that you are missing something to adopt this technology, just drop me a comment explaining your need. Also don’t hesitate to comment this blog post if I missed something or if you don’t agree on some points, I can always review my thoughts.
[1] I was never a big fan of the use of an arbitrator in Galera, as all data need to reach the node anyway, for the storage price those days, I consider that it’s much safer to have a real cluster node where data is also replicated. 3 data copies are always better than 2 😉
The benefit of the arbitrator (witness, log-only node) is to make it possible to get consensus faster without having an extra replica for deployments that prefer at most one copy per geographic region and have geo regions that are far apart. The incremental cost of an arbitrator is much less than the cost of another replica. But maybe that doesn’t matter to most deployments.
Hi Mark,
Thank you for your comment, but on Galera, the arbitrator, gets all the replication data like any other nodes. In fact, we could see it as mysqld storing the data to /dev/null (but this is not the case, there is no mysqld).
And for that reason, as the data reaches already the node and the latency is also already added to the process, I think it’s better to save that data too.
Google had “witness replicas” in some of their Paxos implementations. We have something like witness replicas with our lossless semisync implementation. It is OK for us to agree that Group Replication is missing this feature. This is a feature and Group Replication isn’t better for not having it because without witness replicas the cost is either slower commit — because the replicas are farther apart on the network — or more hardware — because I added replicas that are only needed to make consensus faster.
Unfortunately the Galera arbitrator (nor the mongodb one, for that matter) also doesn’t store a transaction log, so it is not useful for this purpose either.
Bonjour Frédéric,
Ton article est très interessant, toutefois je suis d’accord avec Mark, le witness/arbitre est un must have.
De nombreuses sociétés travaillent uniquement sur 2DC et ont juste des witness sur un 3° notamment pour VMWare et autre techno d’infras. Avoir un troisième jeu de données est tout de même couteux.
D’autant plus que sur le front de l’open source, beaucoup de techno permettent de faire cela (PostgreSQL Patroni, PostgreSQL RepMgr).
Etant donné tes entrées chez MySQL, un petit mot au contributeur sur l’ajout de cette fonctionnalité serait bienvenue 😀
Bonne journée
Pierre
Merci pour le commentaire.
Dans patroni, 3 DC sont aussi nécessaire selon la documentation (https://patroni.readthedocs.io/en/latest/ha_multi_dc.html).
Le soucis d’un arbitrator pour un système de replication comme Groupe Replication (ou Galera) c’est que cet éventuel arbitrator doit également faire partie de la certification et recevra toutes les données qui sont répliquées de toute façon.
Group Replication (comparé à Galera) utilie Paxos, ce qui permet à une transaction de continuer son processus (son commit) lorsque la majorité des noeuds ont reçu l’info… quid d’un cluster avec 2 noeuds et un arbitrator ? Si l’arbitrator répond avant… et bien on risque de perdre des données. Le plus important pour nous ce sont les données.
Maintenant si c’est pour utiliser plusieurs DC, il existe d’autres solution comme MySQL InnoDB ClusterSet.
Bonne journée
I believe it’s good the arbiter gets the data so it can act as relay between two nodes that might not be able to communicate directly; Not sure how that does happen within GR?
“Galera is using a proprietary group communication system layer”
I’m wondering, what exactly is proprietary there? I’m not a Galera expert, but IIRC the code is open-source, it’s based on some PHD Thesis.
” MySQL Group Replication use a Group Communication System (GCS) based on a variant of the popular Paxos algorithm.”
So, is that one less proprietary? Is Galera’s mechanism further from Paxos than GCS in some way?
Hi Sergei,
You are right Galera is open-source and based on some PHD Thesis, but from Galera’s documentation, http://galeracluster.com/documentation-webpages/architecture.html, you can read “Galera Cluster is built on top of a proprietary group communication system layer, which implements a virtual synchrony QoS. Virtual synchrony unifies the data delivery and cluster membership services, providing clear formalism for message delivery semantics.“.
You can also find some references to this “proprietary” group communication system in Galera’s mailing list. Now, as it is open-source, this means it’s their own communication system as I’ve not seen some particular things in the license included with the sources.
So, Galera’s communication system is not based on Paxos at all.
I hope this answer your questions.
Hi Sergei,
If you or anyone else is interested in the relevant academic papers:
1. Galera — Virtual Synchrony:
A. http://dsn.jhu.edu/~yairamir/Yair_phd.pdf
B. https://pdfs.semanticscholar.org/87fc/089838a3959d652d503c7ee7b39f5c94a34b.pdf
2. Group Replication — Paxos (more specifically a Paxos variant called Mencius): http://sysnet.ucsd.edu/~yamao/pub/mencius-osdi.pdf
Thanks for this, Matt.
Hi Fred,
Have you started using GR please?
Very much like to see feedbacks from community.
Thanks for this article.
James
[…] 不愧是 Oracle 的 MySQL Community Manager,把對手的 Galera Cluster 講的一無是處 XDDD:「Group Replication is GA with MySQL 5.7.17 – comparison with Galera」。 […]
Nice Article ! Better Comments !
[…] Group Replication is GA with MySQL 5.7.17 – comparison with Galera […]
[…] And finally, I completely agree with Marco when he says that Group Replication is based on asynchronous replication… but this is exactly the same for Galera (and PXC) where flow-control is totally blocking the cluster which is not the case in Group Replication (with other differences some good, some bad). […]