As a DBA, develop a critical mind when receiving advice

On the Internet, you can get a lot of advice from almost anywhere when you’re looking for information on how to do your DBA job.

My blog is one of these sources of advice, and depending on the source, we generally trust the advice more or less.

But sometimes advice doesn’t take the whole picture into account, and even if it comes from a recognized authority, it can lead to more or less serious problems.

Let’s consider the following situation:

We have an InnoDB ReplicaSet with 1 replication source (primary) and 2 replicas:

 JS > rs.status()
{
    "replicaSet": {
        "name": "myReplicaSet", 
        "primary": "127.0.0.1:3310", 
        "status": "AVAILABLE", 
        "statusText": "All instances available.", 
        "topology": {
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "instanceRole": "PRIMARY", 
                "mode": "R/W", 
                "status": "ONLINE"
            }, 
            "127.0.0.1:3320": {
                "address": "127.0.0.1:3320", 
                "instanceRole": "SECONDARY", 
                "mode": "R/O", 
                "replication": {
                    "applierStatus": "APPLIED_ALL", 
                    "applierThreadState": "Waiting for an event from Coordinator", 
                    "applierWorkerThreads": 4, 
                    "receiverStatus": "ON", 
                    "receiverThreadState": "Waiting for source to send event", 
                    "replicationLag": null, 
                    "replicationSsl": "TLS_AES_256_GCM_SHA384 TLSv1.3"
                }, 
                "status": "ONLINE"
            }, 
            "127.0.0.1:3330": {
                "address": "127.0.0.1:3330", 
                "instanceRole": "SECONDARY", 
                "mode": "R/O", 
                "replication": {
                    "applierStatus": "APPLIED_ALL", 
                    "applierThreadState": "Waiting for an event from Coordinator", 
                    "applierWorkerThreads": 4, 
                    "receiverStatus": "ON", 
                    "receiverThreadState": "Waiting for source to send event", 
                    "replicationLag": null, 
                    "replicationSsl": "TLS_AES_256_GCM_SHA384 TLSv1.3"
                }, 
                "status": "ONLINE"
            }
        }, 
        "type": "ASYNC"
    }
}

As a good DBA, we also have trending graphs and we can some aborted connections… and if we check the information about this metric in Prometheus/PMM, we can read that we can cleanup the host cache table if an host reached the max_connect_errors threshold:

So let’s do that:

 SQL > flush hosts;
Query OK, 0 rows affected, 1 warning (0.0080 sec)
Warning (code 1287): 'FLUSH HOSTS' is deprecated and will be removed in a future release.
Please use TRUNCATE TABLE performance_schema.host_cache instead

Ok, it seems that this command is deprecated, but it worked…

But after that if we check the status of our ReplicaSet, we can notice something strange:

 JS > rs.status()
{
    "replicaSet": {
        "name": "myReplicaSet", 
        "primary": "127.0.0.1:3310", 
        "status": "AVAILABLE_PARTIAL", 
        "statusText": "The PRIMARY instance is available, but one or more SECONDARY instances are not.", 
        "topology": {
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "instanceRole": "PRIMARY", 
                "mode": "R/W", 
                "status": "ONLINE"
            }, 
            "127.0.0.1:3320": {
                "address": "127.0.0.1:3320", 
                "fenced": true, 
                "instanceErrors": [
                    "ERROR: 1 errant transaction(s) detected. Topology changes will not be possible until the instance is removed from the replicaset to have the inconsistency repaired."
                ], 
                "instanceRole": "SECONDARY", 
                "mode": "R/O", 
                "replication": {
                    "applierStatus": "APPLIED_ALL", 
                    "applierThreadState": "Waiting for an event from Coordinator", 
                    "applierWorkerThreads": 4, 
                    "receiverStatus": "ON", 
                    "receiverThreadState": "Waiting for source to send event", 
                    "replicationLag": null, 
                    "replicationSsl": "TLS_AES_256_GCM_SHA384 TLSv1.3"
                }, 
                "status": "INCONSISTENT", 
                "transactionSetConsistencyStatus": "INCONSISTENT", 
                "transactionSetConsistencyStatusText": "There are 1 transactions that were executed in this instance that did not originate from the PRIMARY.", 
                "transactionSetErrantGtidSet": "6b5c4051-1bf5-11ee-8304-c8cb9e32df8e:1"
            }, 
            "127.0.0.1:3330": {
                "address": "127.0.0.1:3330", 
                "instanceRole": "SECONDARY", 
                "mode": "R/O", 
                "replication": {
                    "applierStatus": "APPLIED_ALL", 
                    "applierThreadState": "Waiting for an event from Coordinator", 
                    "applierWorkerThreads": 4, 
                    "receiverStatus": "ON", 
                    "receiverThreadState": "Waiting for source to send event", 
                    "replicationLag": null, 
                    "replicationSsl": "TLS_AES_256_GCM_SHA384 TLSv1.3"
                }, 
                "status": "ONLINE"
            }
        }, 
        "type": "ASYNC"
    }
}

Ouch ! That FLUSH HOSTS statement created a transaction on my replica !

Let’s verify that this errant transaction belongs to the previous FLUSH command:

$ mysqlbinlog dell-bin.000002 | grep '6b5c4051-1bf5-11ee-8304-c8cb9e32df8e:1' -A 5
SET @@SESSION.GTID_NEXT= '6b5c4051-1bf5-11ee-8304-c8cb9e32df8e:1'/*!*/;
# at 3752
#230706 14:19:40 server id 338567193  end_log_pos 3829 CRC32 0x7d36a7d0 	Query	thread_id=27	exec_time=0error_code=0
SET TIMESTAMP=1688645980/*!*/;
SET @@session.sql_mode=1168113696/*!*/;
flush hosts

So indeed, the GTIDs matches…

It could have been worse: if the replica had been restarted (maintenance, crash), it would never have been able to join the ReplicaSet.

 JS > rs.status()
{
    "replicaSet": {
        "name": "myReplicaSet", 
        "primary": "127.0.0.1:3310", 
        "status": "AVAILABLE_PARTIAL", 
        "statusText": "The PRIMARY instance is available, but one or more SECONDARY instances are not.", 
        "topology": {
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "instanceRole": "PRIMARY", 
                "mode": "R/W", 
                "status": "ONLINE"
            }, 
            "127.0.0.1:3320": {
                "address": "127.0.0.1:3320", 
                "fenced": true, 
                "instanceErrors": [
                    "ERROR: 1 errant transaction(s) detected. Topology changes will not be possible until the instance is removed from the replicaset to have the inconsistency repaired."
                ], 
                "instanceRole": "SECONDARY", 
                "mode": "R/O", 
                "replication": {
                    "applierStatus": "APPLIED_ALL", 
                    "applierThreadState": "Waiting for an event from Coordinator", 
                    "applierWorkerThreads": 4, 
                    "receiverStatus": "ON", 
                    "receiverThreadState": "Waiting for source to send event", 
                    "replicationLag": null, 
                    "replicationSsl": "TLS_AES_256_GCM_SHA384 TLSv1.3"
                }, 
                "status": "INCONSISTENT", 
                "transactionSetConsistencyStatus": "INCONSISTENT", 
                "transactionSetConsistencyStatusText": "There are 1 transactions that were executed in this instance that did not originate from the PRIMARY.", 
                "transactionSetErrantGtidSet": "6b5c4051-1bf5-11ee-8304-c8cb9e32df8e:1"
            }, 
            "127.0.0.1:3330": {
                "address": "127.0.0.1:3330", 
                "instanceRole": "SECONDARY", 
                "mode": "R/O", 
                "replication": {
                    "applierStatus": "APPLIED_ALL", 
                    "applierThreadState": "Waiting for an event from Coordinator", 
                    "applierWorkerThreads": 4, 
                    "receiverStatus": "ON", 
                    "receiverThreadState": "Waiting for source to send event", 
                    "replicationLag": null, 
                    "replicationSsl": "TLS_AES_256_GCM_SHA384 TLSv1.3"
                }, 
                "status": "ONLINE"
            }
        }, 
        "type": "ASYNC"
    }
}

The same problem arises with RAW asynchronous replication (configured manually) or with InnoDB ClusterSet.

As there is no data inconsistency as the errant transaction was only related to a `FLUSH HOSTS` command, we can fix it by inserting an empty transaction to the Primary instance:

SQL >  SET GTID_NEXT="6b5c4051-1bf5-11ee-8304-c8cb9e32df8e:1";
SQL >  START TRANSACTION;
SQL >  COMMIT;
SQL >  SET GTID_NEXT="AUTOMATIC"; 

And now the replica will be able to join back.

The supervision system gave us bad advice, which we followed without thinking about our architecture.

When replication is involved, it’s always recommended to use the LOCAL keyword with FLUSH instructions.

And for the host cache table, the TRUNCATE command in the Performance_Schema table is also recommended.

To illustrate this, take a look at the following commands and the created GTIDs:

 SQL > select @@gtid_executed\G
*************************** 1. row ***************************
@@gtid_executed: f72446ef-1bf4-11ee-8b28-c8cb9e32df8e:1-105
1 row in set (0.0007 sec)

 SQL > flush hosts;
Query OK, 0 rows affected, 1 warning (0.0078 sec)

 SQL > select @@gtid_executed\G
*************************** 1. row ***************************
@@gtid_executed: 61314e4f-1bfd-11ee-8bf1-c8cb9e32df8e:1,
f72446ef-1bf4-11ee-8b28-c8cb9e32df8e:1-105
1 row in set (0.0004 sec)

 SQL > flush local hosts;
Query OK, 0 rows affected, 1 warning (0.0003 sec)

 SQL > select @@gtid_executed\G
*************************** 1. row ***************************
@@gtid_executed: 61314e4f-1bfd-11ee-8bf1-c8cb9e32df8e:1,
f72446ef-1bf4-11ee-8b28-c8cb9e32df8e:1-105
1 row in set (0.0003 sec)

 SQL > truncate performance_schema.host_cache;
Query OK, 0 rows affected (0.0006 sec)

 SQL > select @@gtid_executed\G
*************************** 1. row ***************************
@@gtid_executed: 61314e4f-1bfd-11ee-8bf1-c8cb9e32df8e:1,
f72446ef-1bf4-11ee-8b28-c8cb9e32df8e:1-105
1 row in set (0.0007 sec)

We can notice that only the first command generated a new GTID.

Conclusion

When applying recommendations, it’s always necessary to understand the implication of the commands and whether they apply to your environment without side effects.

Enjoy MySQL !

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

As MySQL Community Manager, I am an employee of Oracle and the views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

You can find articles I wrote on Oracle’s blog.