MySQL InnoDB Cluster – consistency levels

Consistency during reads have been a small concern from the adopters of MySQL InnoDB Cluster (see this post and this one).

This is why MySQL supports now (since 8.0.14) a new consistency model to avoid such situation when needed.

Nuno Carvalho and Aníbal Pinto already posted a blog series I highly encourage you to read:

After those great articles, let’s check how that does work with some examples.

The environment

This is how the environment is setup:

  • 3 members: mysql1, mysql2 & mysql3
  • the cluster runs in Single-Primay mode
  • mysql1 is the Primary Master
  • some extra sys views are installed

Example 1 – EVENTUAL

This is the default behavior (group_replication_consistency='EVENTUAL'). The scenario is the following:

  • we display the default value of the session variable controlling the Group Replication Consistency on the Primary and on one Secondary
  • we lock a table on a Secondary master (mysql3) to block the apply of the transaction coming from the Primary
  • we demonstrate that even if we commit a new transaction on mysql1, we can read the table on mysql3 and the new record is missing (the write could not happen due to the lock)
  • once unlocked, the transaction is applied and the record is visible on the Secondary master (mysql3) too.

Example 2 – BEFORE

In this example, we will illustrate how we can avoid inconsistent reads on a Secondary master:

As you could notice, once we have set the session variable controlling the consistency, operations on the table (the server is READ-ONLY) are waiting for the Apply Queue to be empty before returning the result set.

We could also notice that the wait time (timeout) for this read operation is very long (8 hours by default) and can be modified to a shorter period:

We used SET wait_timeout=10 to define it to 10 seconds.

When the timeout is reached, the following error is returned:

ERROR: 3797: Error while waiting for group transactions commit on group_replication_consistency= 'BEFORE'

Example 3 – AFTER

It’s also possible to return from commit on the writer only when all members applied the change too. Let’s check this in action too:

This can be considered as synchronous writes as the return from commit happens only when all members have applied it. However you could also notice that in this consistency level, wait_timeout has not effect on the write. In fact wait_timeout has only effect on read operations when the consistency level is different than EVENTUAL.

This means that this can lead to several issues if you lock a table for any reason. If the DBA needs to perform some maintenance operations and requires to lock a table for a long time, it’s mandatory to not operate queries in AFTER or BEFORE_AND_AFTERwhile in such maintenance.

Example 4 – Scope

In the following video, I just want to show you the “scope” of these “waits” for transactions that are in the applying queue.

We will lock again t1 but on a Secondary master, we will perform a SELECT from table t2, the first time we will keep the default value of group_replication_consistency(EVENTUAL) and the second time we will change the consistency level to BEFORE :

We could see that as soon as they are transactions in the apply queue, if you change the consistency level to something BEFORE, it needs to wait for the previous transactions in the queue to be applied even if those events are related or not to the same table(s) or record(s). It doesn’t matter.

Example 5 – Observability

Of course it’s possible to check what’s going on and if queries are waiting for something.

BEFORE

When group_replication_consistency is set to BEFORE (or includes it), while a transaction is waiting for the applying queue to be committed, it’s possible to track those waiting transactions by running the following query:

SELECT * FROM information_schema.processlist 
WHERE state='Executing hook on transaction begin.';

AFTER

When group_replication_consistency is set to AFTER (or includes it), while a transaction is waiting for the transaction to be committed on the other members too, it’s possible to track those waiting transactions by running the following query:

SELECT * FROM information_schema.processlist 
WHERE state='waiting for handler commit';

It’s also possible to have even more information joining the processlist and InnoDB Trx tables:

SELECT *, TIME_TO_SEC(TIMEDIFF(now(),trx_started)) lock_time_sec 
FROM information_schema.innodb_trx JOIN information_schema.processlist
ON processlist.ID=innodb_trx.trx_mysql_thread_id
WHERE state='waiting for handler commit' ORDER BY trx_started\G

Conclusion

This consistency level is a wonderful feature but it could become dangerous if abused without full control of your environment.

I would avoid to set anything AFTER globally if you don’t control completely your environment. Table locks, DDLs, logical backups, snapshots could all delay the commits and transactions could start pilling up on the Primary Master. But if you control your environment, you have now the complete freedom to control completely the consistency you need on your MySQL InnoDB Cluster.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

5 Comments

  1. Hi Lefred,

    I have a Innodb Cluster Mysql 8.0.19 single primary with 3 nodes in production and I have problems with queries that including longblobs. In the processlist I can see a lot of querys “waiting for handler commit” and in the logs of mysql, the nodes switch in reachable and unreachable mode.
    I have a Innodb Cluster Mysql 8.0.23 single primary with 2 nodes to test and I have the same problems. I test running querys with mysqlslap, insert 30,40,50 querys simultaneos including a longblob around about 13MB and I have the same problem the processlist with a lot of querys “waiting for handler commit” and the process not finish.
    I test the querys usind only one node in the cluster and the problem disappear.
    Could you tell me what is it happening?
    Thanks a lot

      • Hi Lefred,

        Thanks for you quick response.
        The size of the trx in the binary log is 14MB for all transact. If I send 50 queries, 50 results in binary log.
        localhost:33060+ ssl JS > check.showTrxSize(‘binlog.000020’)
        Transactions in binary log binlog.000020:
        14 mb
        14 mb
        14 mb
        14 mb
        14 mb
        14 mb
        14 mb
        14 mb
        14 mb
        …..

        localhost:33060+ ssl JS > check.getBinlogsIO()
        +————————-+————+————+————+————-+—————+————+————+———–+
        | file | count_read | total_read | avg_read | count_write | total_written | avg_write | total | write_pct |
        +————————-+————+————+————+————-+—————+————+————+———–+
        | @@datadir/binlog.000001 | 21 | 10.31 KiB | 502 bytes | 0 | 0 bytes | 0 bytes | 10.31 KiB | 0.00 |
        | @@datadir/binlog.000002 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000003 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000004 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000005 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000006 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000007 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000008 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000009 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000010 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000011 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000012 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000013 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000014 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000015 | 12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
        | @@datadir/binlog.000016 | 5523 | 3.54 GiB | 672.73 KiB | 0 | 0 bytes | 0 bytes | 3.54 GiB | 0.00 |
        | @@datadir/binlog.000017 | 18 | 13.40 MiB | 762.55 KiB | 66775 | 1.01 GiB | 15.80 KiB | 1.02 GiB | 98.72 |
        | @@datadir/binlog.000018 | 3227 | 2.07 GiB | 672.41 KiB | 68649 | 1.03 GiB | 15.80 KiB | 3.10 GiB | 33.33 |
        | @@datadir/binlog.000019 | 3706 | 2.38 GiB | 673.73 KiB | 66798 | 1.01 GiB | 15.80 KiB | 3.39 GiB | 29.71 |
        | @@datadir/binlog.000020 | 1436 | 943.40 MiB | 672.73 KiB | 51724 | 798.25 MiB | 15.80 KiB | 1.70 GiB | 45.83 |
        | @@datadir/binlog.index | 2 | 640 bytes | 320 bytes | 0 | 0 bytes | 0 bytes | 640 bytes | 0.00 |
        +————————-+————+————+————+————-+—————+————+————+———–+
        total_latency: 19.58 s count_read: 14109 (8.93 GiB) count_write: 253954 (3.83 GiB)

        My param
        select @@group_replication_transaction_size_limit;
        +——————————————–+
        | @@group_replication_transaction_size_limit |
        +——————————————–+
        | 150000000 |
        +——————————————–+
        1 row in set (0.00 sec)

        The benchmark of 50 queries. ( –concurrency=50 –iterations=1 )

        With only one node
        Benchmark
        Average number of seconds to run all queries: 141.006 seconds
        Minimum number of seconds to run all queries: 141.006 seconds
        Maximum number of seconds to run all queries: 141.006 seconds
        Number of clients running queries: 50
        Average number of queries per client: 13

        With two nodes
        Benchmark
        Average number of seconds to run all queries: 202.321 seconds
        Minimum number of seconds to run all queries: 202.321 seconds
        Maximum number of seconds to run all queries: 202.321 seconds
        Number of clients running queries: 50
        Average number of queries per client: 13

        Although with two nodes normally don’t finish. In my tests I wait for more than 30 min, and some queries continue “waiting for handler commit”.
        For example, I send 70 queries concurrent, 27 register in binlog 21, 22 register in binlog 22 and pending 21 that in processlist are “waiting for handler commit”
        When crash, binlog don’t grow.
        mysql> show processlist;
        +—–+——————-+———————-+————-+———+——-+——————————————————–+———————————-+
        | Id | User | Host | db | Command | Time | State | Info |
        +—–+——————-+———————-+————-+———+——-+——————————————————–+———————————-+
        | 6 | event_scheduler | localhost | NULL | Daemon | 67267 | Waiting on empty queue | NULL |
        | 10 | system user | | NULL | Connect | 67267 | waiting for handler commit | Group replication applier module |
        | 13 | system user | | NULL | Query | 1868 | Slave has read all relay log; waiting for more updates | NULL |
        | 14 | system user | | NULL | Connect | 67267 | Waiting for an event from Coordinator | NULL |
        | 15 | system user | | NULL | Connect | 67267 | Waiting for an event from Coordinator | NULL |
        | 16 | system user | | NULL | Connect | 67267 | Waiting for an event from Coordinator | NULL |
        | 17 | system user | | NULL | Connect | 67267 | Waiting for an event from Coordinator | NULL |
        | 247 | clustermanagerpre | localhost:52602 | NULL | Sleep | 407 | NULL | PLUGIN |
        | 248 | clustermanagerpre | localhost | NULL | Query | 0 | init | show processlist |
        | 249 | clustermanagerpre | localhost:36012 | NULL | Sleep | 798 | | NULL |
        | 423 | clusterpre | 192.168.202.57:32862 | NULL | Sleep | 507 | | NULL |
        | 428 | clusterpre | 192.168.202.57:32874 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
        | 429 | clusterpre | 192.168.202.57:32866 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
        | 442 | clusterpre | 192.168.202.57:32900 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
        | 447 | clusterpre | 192.168.202.57:32910 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
        | 454 | clusterpre | 192.168.202.57:32924 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
        | 457 | clusterpre | 192.168.202.57:32930 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
        | 458 | clusterpre | 192.168.202.57:32932 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
        | 460 | clusterpre | 192.168.202.57:32936 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
        | 465 | clusterpre | 192.168.202.57:32946 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
        | 471 | clusterpre | 192.168.202.57:32958 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
        | 475 | clusterpre | 192.168.202.57:32966 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
        | 476 | clusterpre | 192.168.202.57:32968 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
        | 477 | clusterpre | 192.168.202.57:32970 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
        | 478 | clusterpre | 192.168.202.57:32972 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
        | 482 | clusterpre | 192.168.202.57:32980 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
        | 483 | clusterpre | 192.168.202.57:32982 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
        | 485 | clusterpre | 192.168.202.57:32986 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
        | 487 | clusterpre | 192.168.202.57:32988 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
        | 488 | clusterpre | 192.168.202.57:32992 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
        | 492 | clusterpre | 192.168.202.57:33000 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
        | 493 | clusterpre | 192.168.202.57:33002 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
        +—–+——————-+———————-+————-+———+——-+——————————————————–+———————————-+

        In the 8.0.23 don’t appear any error in mysql.log.
        In the 8.0.19 appear the following message in mysql.log of both nodes:

        2021-01-26T10:37:48.744425Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: ‘Member with address mysqlclustertest:3306 has become unreachable.’
        2021-01-26T10:37:48.744486Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: ‘This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.’
        2021-01-26T10:37:49.744486Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: ‘Member with address mysqlclustertest:3306 is reachable again.’
        2021-01-26T10:37:49.744545Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: ‘The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.’

        And the queries “waiting for handler commit”
        Sorry, but we are desperated. We dont find any way to solve the problem.
        Thanks in advanced

        • Hi Marta,

          From the logs it seems there is a network glitch, a member becomes unreachable and comes back 1 second later.

          What kind of link do you have between your nodes ?

          I advise you to submit a bug, follow up will be easier.

          Cheers.

          • Hi Lefred!
            The nodes run over 1Gbps LAN. We made a test moving the nodes to a unique server and the problem is solved.
            We are trying detect the LAN problem now.
            Very thanks for you help.
            Cheers.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.