MySQL InnoDB Cluster – consistency levels

lefred
March 14, 2019
8.0, General, Group Replication, HA, InnoDB Cluster, MySQL, MySQL 8.0
5 Comments

Consistency during reads have been a small concern from the adopters of MySQL InnoDB Cluster (see this post and this one).

This is why MySQL supports now (since 8.0.14) a new consistency model to avoid such situation when needed.

Nuno Carvalho and Aníbal Pinto already posted a blog series I highly encourage you to read:

After those great articles, let’s check how that does work with some examples.

The environment

This is how the environment is setup:

3 members: mysql1, mysql2 & mysql3
the cluster runs in Single-Primay mode
mysql1 is the Primary Master
some extra sys views are installed

Example 1 – EVENTUAL

This is the default behavior (group_replication_consistency='EVENTUAL'). The scenario is the following:

we display the default value of the session variable controlling the Group Replication Consistency on the Primary and on one Secondary
we lock a table on a Secondary master (mysql3) to block the apply of the transaction coming from the Primary
we demonstrate that even if we commit a new transaction on mysql1, we can read the table on mysql3 and the new record is missing (the write could not happen due to the lock)
once unlocked, the transaction is applied and the record is visible on the Secondary master (mysql3) too.

Example 2 – BEFORE

In this example, we will illustrate how we can avoid inconsistent reads on a Secondary master:

As you could notice, once we have set the session variable controlling the consistency, operations on the table (the server is READ-ONLY) are waiting for the Apply Queue to be empty before returning the result set.

We could also notice that the wait time (timeout) for this read operation is very long (8 hours by default) and can be modified to a shorter period:

We used SET wait_timeout=10 to define it to 10 seconds.

When the timeout is reached, the following error is returned:

ERROR: 3797: Error while waiting for group transactions commit on group_replication_consistency= 'BEFORE'

Example 3 – AFTER

It’s also possible to return from commit on the writer only when all members applied the change too. Let’s check this in action too:

This can be considered as synchronous writes as the return from commit happens only when all members have applied it. However you could also notice that in this consistency level, wait_timeout has not effect on the write. In fact wait_timeout has only effect on read operations when the consistency level is different than EVENTUAL.

This means that this can lead to several issues if you lock a table for any reason. If the DBA needs to perform some maintenance operations and requires to lock a table for a long time, it’s mandatory to not operate queries in AFTER or BEFORE_AND_AFTERwhile in such maintenance.

Example 4 – Scope

In the following video, I just want to show you the “scope” of these “waits” for transactions that are in the applying queue.

We will lock again t1 but on a Secondary master, we will perform a SELECT from table t2, the first time we will keep the default value of group_replication_consistency(EVENTUAL) and the second time we will change the consistency level to BEFORE :

We could see that as soon as they are transactions in the apply queue, if you change the consistency level to something BEFORE, it needs to wait for the previous transactions in the queue to be applied even if those events are related or not to the same table(s) or record(s). It doesn’t matter.

Example 5 – Observability

Of course it’s possible to check what’s going on and if queries are waiting for something.

BEFORE

When group_replication_consistency is set to BEFORE (or includes it), while a transaction is waiting for the applying queue to be committed, it’s possible to track those waiting transactions by running the following query:

SELECT * FROM information_schema.processlist 
WHERE state='Executing hook on transaction begin.';

AFTER

When group_replication_consistency is set to AFTER (or includes it), while a transaction is waiting for the transaction to be committed on the other members too, it’s possible to track those waiting transactions by running the following query:

SELECT * FROM information_schema.processlist 
WHERE state='waiting for handler commit';

It’s also possible to have even more information joining the processlist and InnoDB Trx tables:

SELECT *, TIME_TO_SEC(TIMEDIFF(now(),trx_started)) lock_time_sec 
FROM information_schema.innodb_trx JOIN information_schema.processlist
ON processlist.ID=innodb_trx.trx_mysql_thread_id 
WHERE state='waiting for handler commit' ORDER BY trx_started\G

Conclusion

This consistency level is a wonderful feature but it could become dangerous if abused without full control of your environment.

I would avoid to set anything AFTER globally if you don’t control completely your environment. Table locks, DDLs, logical backups, snapshots could all delay the commits and transactions could start pilling up on the Primary Master. But if you control your environment, you have now the complete freedom to control completely the consistency you need on your MySQL InnoDB Cluster.

5 Comments

Marta Vazquez

January 25, 2021 / 18:30 Reply

Hi Lefred,

I have a Innodb Cluster Mysql 8.0.19 single primary with 3 nodes in production and I have problems with queries that including longblobs. In the processlist I can see a lot of querys “waiting for handler commit” and in the logs of mysql, the nodes switch in reachable and unreachable mode.
I have a Innodb Cluster Mysql 8.0.23 single primary with 2 nodes to test and I have the same problems. I test running querys with mysqlslap, insert 30,40,50 querys simultaneos including a longblob around about 13MB and I have the same problem the processlist with a lot of querys “waiting for handler commit” and the process not finish.
I test the querys usind only one node in the cluster and the problem disappear.
Could you tell me what is it happening?
Thanks a lot
- lefred
  
  January 25, 2021 / 18:40 Reply
  
  Hi Marta,
  
  Please verify the size of the trx in the binary log (see https://github.com/lefred/mysqlshell-plugins/wiki/check#showtrxsizesort).
  
  Cheers
  - Marta Vazquez Gonzalez
    
    January 26, 2021 / 12:42 Reply
    
    Hi Lefred,
    
    Thanks for you quick response.
    The size of the trx in the binary log is 14MB for all transact. If I send 50 queries, 50 results in binary log.
    localhost:33060+ ssl JS > check.showTrxSize(‘binlog.000020’)
    Transactions in binary log binlog.000020:
    14 mb
    14 mb
    14 mb
    14 mb
    14 mb
    14 mb
    14 mb
    14 mb
    14 mb
    …..
    
    localhost:33060+ ssl +———̵ | file +———̵ | @@datadir/binlog.000001 | | @@datadir/binlog.000002 | | @@datadir/binlog.000003 | | @@datadir/binlog.000004 | | @@datadir/binlog.000005 | | @@datadir/binlog.000006 | | @@datadir/binlog.000007 | | @@datadir/binlog.000008 | | @@datadir/binlog.000009 | | @@datadir/binlog.000010 | | @@datadir/binlog.000011 | | @@datadir/binlog.000012 | | @@datadir/binlog.000013 | | @@datadir/binlog.000014 | | @@datadir/binlog.000015 | | @@datadir/binlog.000016 | | @@datadir/binlog.000017 | | @@datadir/binlog.000018 | | @@datadir/binlog.000019 | | @@datadir/binlog.000020 | | @@datadir/binlog.index | +———̵ total_latency: 19.58 s JS > check.getBinlogsIO()
    2;————-+————+————+————+————-+—————+————+————+———–+
    | count_read | total_read | avg_read | count_write | total_written | avg_write | total | write_pct |
    2;————-+————+————+————+————-+—————+————+————+———–+
    21 | 10.31 KiB | 502 bytes | 0 | 0 bytes | 0 bytes | 10.31 KiB | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    12 | 48 bytes | 4 bytes | 0 | 0 bytes | 0 bytes | 48 bytes | 0.00 |
    5523 | 3.54 GiB | 672.73 KiB | 0 | 0 bytes | 0 bytes | 3.54 GiB | 0.00 |
    18 | 13.40 MiB | 762.55 KiB | 66775 | 1.01 GiB | 15.80 KiB | 1.02 GiB | 98.72 |
    3227 | 2.07 GiB | 672.41 KiB | 68649 | 1.03 GiB | 15.80 KiB | 3.10 GiB | 33.33 |
    3706 | 2.38 GiB | 673.73 KiB | 66798 | 1.01 GiB | 15.80 KiB | 3.39 GiB | 29.71 |
    1436 | 943.40 MiB | 672.73 KiB | 51724 | 798.25 MiB | 15.80 KiB | 1.70 GiB | 45.83 |
    2 | 640 bytes | 320 bytes | 0 | 0 bytes | 0 bytes | 640 bytes | 0.00 |
    2;————-+————+————+————+————-+—————+————+————+———–+
    count_read: 14109 (8.93 GiB) count_write: 253954 (3.83 GiB)
    
    My param
    select @@group_replication_transaction_size_limit;
    +——————————————–+
    | @@group_replication_transaction_size_limit |
    +——————————————–+
    | 150000000 |
    +——————————————–+
    1 row in set (0.00 sec)
    
    The benchmark of 50 queries. ( –concurrency=50 –iterations=1 )
    
    With only one node
    Benchmark
    Average number of seconds to run all queries: 141.006 seconds
    Minimum number of seconds to run all queries: 141.006 seconds
    Maximum number of seconds to run all queries: 141.006 seconds
    Number of clients running queries: 50
    Average number of queries per client: 13
    
    With two nodes
    Benchmark
    Average number of seconds to run all queries: 202.321 seconds
    Minimum number of seconds to run all queries: 202.321 seconds
    Maximum number of seconds to run all queries: 202.321 seconds
    Number of clients running queries: 50
    Average number of queries per client: 13
    
    Although with two nodes normally don’t finish. In my tests I wait for more than 30 min, and some queries continue “waiting for handler commit”.
    For example, I send 70 queries concurrent, 27 register in binlog 21, 22 register in binlog 22 and pending 21 that in processlist are “waiting for handler commit”
    When crash, binlog don’t grow.
    mysql> show processlist;
    +—–+——————-+———————-+————-+———+——-+——————————————————–+———————————-+
    | Id | User | Host | db | Command | Time | State | Info |
    +—–+——————-+———————-+————-+———+——-+——————————————————–+———————————-+
    | 6 | event_scheduler | localhost | NULL | Daemon | 67267 | Waiting on empty queue | NULL |
    | 10 | system user | | NULL | Connect | 67267 | waiting for handler commit | Group replication applier module |
    | 13 | system user | | NULL | Query | 1868 | Slave has read all relay log; waiting for more updates | NULL |
    | 14 | system user | | NULL | Connect | 67267 | Waiting for an event from Coordinator | NULL |
    | 15 | system user | | NULL | Connect | 67267 | Waiting for an event from Coordinator | NULL |
    | 16 | system user | | NULL | Connect | 67267 | Waiting for an event from Coordinator | NULL |
    | 17 | system user | | NULL | Connect | 67267 | Waiting for an event from Coordinator | NULL |
    | 247 | clustermanagerpre | localhost:52602 | NULL | Sleep | 407 | NULL | PLUGIN |
    | 248 | clustermanagerpre | localhost | NULL | Query | 0 | init | show processlist |
    | 249 | clustermanagerpre | localhost:36012 | NULL | Sleep | 798 | | NULL |
    | 423 | clusterpre | 192.168.202.57:32862 | NULL | Sleep | 507 | | NULL |
    | 428 | clusterpre | 192.168.202.57:32874 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
    | 429 | clusterpre | 192.168.202.57:32866 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
    | 442 | clusterpre | 192.168.202.57:32900 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
    | 447 | clusterpre | 192.168.202.57:32910 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
    | 454 | clusterpre | 192.168.202.57:32924 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
    | 457 | clusterpre | 192.168.202.57:32930 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
    | 458 | clusterpre | 192.168.202.57:32932 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
    | 460 | clusterpre | 192.168.202.57:32936 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
    | 465 | clusterpre | 192.168.202.57:32946 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
    | 471 | clusterpre | 192.168.202.57:32958 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
    | 475 | clusterpre | 192.168.202.57:32966 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
    | 476 | clusterpre | 192.168.202.57:32968 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
    | 477 | clusterpre | 192.168.202.57:32970 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
    | 478 | clusterpre | 192.168.202.57:32972 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
    | 482 | clusterpre | 192.168.202.57:32980 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
    | 483 | clusterpre | 192.168.202.57:32982 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
    | 485 | clusterpre | 192.168.202.57:32986 | NSR_MNG_PRE | Query | 305 | waiting for handler commit | COMMIT |
    | 487 | clusterpre | 192.168.202.57:32988 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
    | 488 | clusterpre | 192.168.202.57:32992 | NSR_MNG_PRE | Query | 304 | waiting for handler commit | COMMIT |
    | 492 | clusterpre | 192.168.202.57:33000 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
    | 493 | clusterpre | 192.168.202.57:33002 | NSR_MNG_PRE | Query | 303 | waiting for handler commit | COMMIT |
    +—–+——————-+———————-+————-+———+——-+——————————————————–+———————————-+
    
    In the 8.0.23 don’t appear any error in mysql.log.
    In the 8.0.19 appear the following message in mysql.log of both nodes:
    
    2021-01-26T10:37:48.744425Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: ‘Member with address mysqlclustertest:3306 has become unreachable.’
    2021-01-26T10:37:48.744486Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: ‘This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.’
    2021-01-26T10:37:49.744486Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: ‘Member with address mysqlclustertest:3306 is reachable again.’
    2021-01-26T10:37:49.744545Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: ‘The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.’
    
    And the queries “waiting for handler commit”
    Sorry, but we are desperated. We dont find any way to solve the problem.
    Thanks in advanced
    - lefred
      
      January 26, 2021 / 15:07 Reply
      
      Hi Marta,
      
      From the logs it seems there is a network glitch, a member becomes unreachable and comes back 1 second later.
      
      What kind of link do you have between your nodes ?
      
      I advise you to submit a bug, follow up will be easier.
      
      Cheers.
      - Marta Vazquez Gonzalez
        
        January 27, 2021 / 09:28
        
        Hi Lefred!
        The nodes run over 1Gbps LAN. We made a test moving the nodes to a unique server and the problem is solved.
        We are trying detect the LAN problem now.
        Very thanks for you help.
        Cheers.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.