MySQL 8.3: Purging data from your InnoDB Cluster

Maintaining a production dataset at a manageable size can present a considerable challenge during the administration of a MySQL InnoDB Cluster.

Old Days

Back in the day when we only had one main copy of our data (the source), and one read copy (the replica) that we used to look at current and old data from our main system, we used a special trick to remove data without affecting the replica. The trick was to turn off writes to the binary log for our removal commands in the main system. External tools like pt-archiver were also able to use that trick. To stop bypass writing into the binary log, we used the command: SET SQL_LOG_BIN=0.

This mean that on the main production server (replication source), we were purging the data without writing the delete operation into the binary logs:

Current Days

These days, nobody runs a single MySQL node in production. High Availability is highly recommended (required!).

So people are using MySQL InnoDB Cluster as main production solution. In such environment, having a reasonable size dataset is also recommended (CLONE, backups, …).

However the old solution to archive data is not working anymore as if we skip writing events in the binary logs, those events won’t be replicated across the other members of the cluster as Group Replication is also relying on those binary logs.

This means we will end up with an inconsistent cluster where all the nodes have a different dataset!

GTIDs can assist in the process; however, it is often challenging to recall the particular UUID component of the GTID that is utilized for archiving purposes. Moreover, it gets complicated to figure out where these transactions come from.

MySQL 8.3, which launched today, introduces even more efficient ways to streamline our purging and archiving workflows.

GTID TAGS

We MySQL 8.3, the GTID format has evolved and now includes the possibility to have some identifiable tags in it:

source_id:tag:transaction_id

Tags are of course not mandatory and GTIDs will work as before too.

So on the replica we want to be used as the archiving server, we will change the GTID set to include a large range of transaction coming from the cluster but with a tag we defined. In this example, I will use “archiving”.

On a member of the MySQL InnoDB Cluster, the following query was executed to retrieve the UUID segment of the GTID used by the Group:

InnoDB Cluster (any node) > show global variables like '%group_name';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| group_replication_group_name | 1e4516ec-b47c-11ee-bb6e-c8cb9e32df8e |
+------------------------------+--------------------------------------+

Now on the Replica used as archived server, I run the following statement:

Replica > set global gtid_purged=
             "+1e4516ec-b47c-11ee-bb6e-c8cb9e32df8e:archiving:1-9999";

On the Primary instance, we can now start the archiving process like this:

InnoDB Cluster Primary > set gtid_next="automatic:archiving";
InnoDB Cluster Primary > delete from t1 where date(timestamp)<"2023-02-01";
InnoDB Cluster Primary > set gtid_next="automatic"

It’s of course better to use a dedicated session to archive the data.

This is the illustration of the process:

It’s imperative to ensure to use a large range and closely monitor its usage on the cluster. In this example we could only have 9,999 purging transaction, upon reaching this threshold, next actions will be applied on the replica too.

Of course it was already possible to perform such operation with standard GTIDs but it was more complicated, or you needed to use specific UUID. With the new GTID Tag, we can directly and easily identify the purging transactions.

But the integration with external tools is also easier, for example this is an example on how to use GTID and TAGS with pt-archiver:

./pt-archiver \
--source h=production_cluster,P=3306,u=user,p=xxxx,D=mydata,t=t1 \
--purge --where 'date(timestamp)<"2023-04-01"' \
--set-vars 'GTID_NEXT="automatic:archiving"'

Conclusion

MySQL 8.3 marks a notable improvement for database administrators who manage High Availability environments using MySQL InnoDB Cluster. With the introduction of GTID tags, the archiving and purging process becomes substantially more manageable and less prone to human error. These tags allow for clear identification of transaction sources, thus avoiding the pitfalls of inconsistent datasets across cluster nodes.

Beyond simplifying the identification process, GTID tags facilitate the integration with external tools.

Enjoy archiving and purging your data !

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

One comment

  1. Hello,

    Thank you for blogging.

    If we have a 3-node GR cluster and I would like to use the “old” approach, I need to run the delete on each of the three nodes using SET SQL_LOG_BIN=0 on each. This way, I achieve a synchronized cluster, and the replica still retains all the deleted data.

    Using the tag, I save time by running the delete only on one node (then the delete is replicated to the rest of the cluster using GR), while the replica still retains all the deleted data.

    I can see the benefits of both approaches, just for different scenarios.

    Thanks again.

Leave a Reply

Your email address will not be published. Required fields are marked *

As MySQL Community Manager, I am an employee of Oracle and the views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

You can find articles I wrote on Oracle’s blog.