MariaDB Innovation: InnoDB-Based Binary Log

I am starting a new series on what makes MariaDB Server distinct from MySQL, highlighting innovations that make the difference.

MariaDB 12.3 introduces a new binary log implementation that stores binlog events directly in InnoDB-managed tablespaces rather than in separate flat files on disk.

This is an incredible innovation; for a long time, binary logs have been a performance bottleneck. DimK pointed it out several times. [1], [2]

The new binlog design halves the number of fsyncs and improves performance, as Mark Callaghan noted in his blog posts. [3], [4]

How does the new binary log implementation look?

Let’s have a look at what to expect as a DBA when the feature is enabled.

To use the InnoDB-Based Binary Log, you need to have these two variables in your configuration file before starting MariaDB Server:

log_bin
binlog_storage_engine=innodb

I also recommend using the ROW format, enabling log_slave_updates, and defining a unique server_id if you plan to use your server in a replication chain.

MariaDB [fred]> select @@binlog_storage_engine;
+-------------------------+
| @@binlog_storage_engine |
+-------------------------+
| innodb                  |
+-------------------------+
1 row in set (0.001 sec)

This new feature supports only Global Transaction IDs (GTIDs); the old binlog name + position is not supported, which is not a problem, as I highly recommend using GTIDs anyway.

On the filesystem, we can see by default 2 files of 1GB (there are pre-allocated, and the default 1GB is defined by max_binlog_size):

root@mariadb1:/var/lib/mysql# ls -lh binlo*
-rw-rw----. 1 mysql mysql 1.0G Feb 19 11:44 binlog-000000.ibb
-rw-rw----. 1 mysql mysql 1.0G Feb 19 11:19 binlog-000001.ibb

There are no other files, such as indexes.

When performance_schema is enabled, it’s also possible to see those files from the SQL layer:

MariaDB [fred]> select * from sys.io_global_by_file_by_bytes 
                where file like '%.ibb'\G
*************************** 1. row ***************************
         file: @@datadir/binlog-000000.ibb
   count_read: 18
   total_read: 288.00 KiB
     avg_read: 16.00 KiB
  count_write: 0
total_written:    0 bytes
    avg_write:    0 bytes
        total: 288.00 KiB
    write_pct: 0.00
*************************** 2. row ***************************
         file: @@datadir/binlog-000001.ibb
   count_read: 1
   total_read: 16.00 KiB
     avg_read: 16.00 KiB
  count_write: 0
total_written:    0 bytes
    avg_write:    0 bytes
        total: 16.00 KiB
    write_pct: 0.00
2 rows in set (0.023 sec)

But what we can see here is that even under some load, those values are not really updated:

MariaDB [fred]> show global status like 'binl%written';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| Binlog_bytes_written | 1997715 |
+----------------------+---------+
1 row in set (0.000 sec)

Flushing Binary Logs

When we flush the binary logs, the binlog in use will be shrunk to its actual content size, and a new file will be created:

MariaDB [fred]> flush binary logs;
Query OK, 0 rows affected (0.024 sec)
root@mariadb1:/var/lib/mysql# ls -lh binlog-00000*
-rw-rw----. 1 mysql mysql  58M Feb 25 11:05 binlog-000000.ibb
-rw-rw----. 1 mysql mysql 1.0G Feb 25 11:05 binlog-000001.ibb
-rw-rw----. 1 mysql mysql 1.0G Feb 25 11:05 binlog-000002.ibb

Unfortunately, this appears to have no effect on the sys table:

MariaDB [fred]> select * from sys.io_global_by_file_by_bytes 
                where file like '%.ibb'\G
*************************** 1. row ***************************
         file: @@datadir/binlog-000000.ibb
   count_read: 18
   total_read: 288.00 KiB
     avg_read: 16.00 KiB
  count_write: 0
total_written:    0 bytes
    avg_write:    0 bytes
        total: 288.00 KiB
    write_pct: 0.00
*************************** 2. row ***************************
         file: @@datadir/binlog-000001.ibb
   count_read: 1
   total_read: 16.00 KiB
     avg_read: 16.00 KiB
  count_write: 0
total_written:    0 bytes
    avg_write:    0 bytes
        total: 16.00 KiB
    write_pct: 0.00
2 rows in set (0.036 sec)

The new InnoDB-Based Binary Log files are not yet correctly instrumented in performance_schema. None of the “file_instances” tables is yet 100% usable. For example, after a restart, the filesystem is like this:

root@mariadb1:/var/lib/mysql# ls -lh binlog-00000*
-rw-rw----. 1 mysql mysql  58M Feb 25 11:05 binlog-000000.ibb
-rw-rw----. 1 mysql mysql 1.0G Feb 25 12:04 binlog-000001.ibb
-rw-rw----. 1 mysql mysql 1.0G Feb 25 11:05 binlog-000002.ibb

But in performance_schema, we still have only:

MariaDB [performance_schema]> select * from performance_schema.file_instances where file_name like '%ibb';
+----------------------------------+--------------------------------------+------------+
| FILE_NAME                        | EVENT_NAME                           | OPEN_COUNT |
+----------------------------------+--------------------------------------+------------+
| /var/lib/mysql/binlog-000002.ibb | wait/io/file/innodb/innodb_data_file |          1 |
| /var/lib/mysql/binlog-000001.ibb | wait/io/file/innodb/innodb_data_file |          1 |
+----------------------------------+--------------------------------------+------------+
2 rows in set (0.000 sec)

Hopefully, the binary log files are correct using the SHOW BINARY LOGS statement:

MariaDB [performance_schema]> show binary logs;
+-------------------+------------+
| Log_name          | File_size  |
+-------------------+------------+
| binlog-000000.ibb |   60735488 |
| binlog-000001.ibb | 1073741824 |
| binlog-000002.ibb | 1073741824 |
+-------------------+------------+
3 rows in set (0.000 sec)

InnoDB-Based Binary Logs and the Buffer Pool

As the binlogs are using the InnoDB storage engine, one can think that the pages of these tablespaces are also using the InnoDB Buffer Pool as any other InnoDB tables, but in fact, the binlog pages are not using the normal InnoDB Buffer Pool. They are using a separate “binlog page fifo” analogous to the Buffer Pool. The logic behind it is much simpler, as binlog pages are written sequentially and never updated in place, only appended.

Those pages use the redo log mechanism without contention with other InnoDB pages in the Buffer Pool, as the mechanisms are separate. This explanation is from Kristian Nielsen; you can find it in the MariaDB Developer mailing list.

Another notable difference between the Buffer Pool and the binlog page fifo is that within the InnoDB Buffer Pool, it’s generally preferable to delay writing pages to disk for performance. Having many dirty pages is generally better. I mostly suggest having 70-90%. On the binlog page fifo, however, it’s advantageous to write pages to the filesystem cache as soon as they’re completely written, since they won’t be changed anyway.

Kristian here discusses the O/S filesystem cache and leaving the O/S with maximum flexibility in scheduling physical disk writes. I was thinking that could be affected by the value of innodb_flush_method, but this doesn’t control it anymore since MariaDB 11.0.

Also, it’s important to note that InnoDB-based binary logs do not follow the innodb_fast_shutdown setting as one should generally expect; we are talking about InnoDB, aren’t we? This means that regardless of the value of innodb_fast_shutdown, even if it’s>0, the pages in the “binlog page fifo” are flushed to the binlog files on disk at shutdown.

Understanding O_DIRECT and Its Role in InnoDB File Caching

So let’s go back a little to the O/S filesystem and cache.

I usually recommend using O_DIRECT for InnoDB files. O_DIRECT lets InnoDB bypass the O/S filesystem cache, which avoids double buffering. Indeed, InnoDB already has all the modified data in its buffer pool (in RAM) and flushes it to disk during checkpointing. And if we let the kernel handle writing them to files, we waste memory (cached in RAM). For typical InnoDB servers, O_DIRECT avoids wasted caching, improves memory use, and usually aligns better with how InnoDB manages its I/O. Remember, DBAs hate when their systems swap!

In the past, only Percona Server could bypass the filesystem cache for data and log files by using ALL_O_DIRECT.

But, like the old binary logs, InnoDB-based binlogs don’t align with how the server handles buffering for InnoDB files.

So even if the following variables are set to OFF (the default), which is equivalent to the old innodb_flush_method=O_DIRECT for data + same behavior for logs, InnoDB-based binlogs will continue to use the filesystem cache. Meaning that some memory will be used for the binlog page fifo and for the binlog files too:

We can easily verify this. I’ve run some sysbench load, and we can see that the InnoDB data files are not using the filesystem cache:

root@mariadb1:/var/lib/mysql# python3.11 ~/bin/dbsake fincore sbtest/*ibd
sbtest/sbtest1.ibd: total_pages=36864 cached=52 percent=0.14
sbtest/sbtest2.ibd: total_pages=36864 cached=52 percent=0.14

And this is the same for the InnoDB log files:

root@mariadb1:/var/lib/mysql# python3.11 ~/bin/dbsake fincore ib_logfile0 
ib_logfile0: total_pages=24576 cached=0 percent=0.00

But if we check the binary log files, we can see that the filesystem is using memory to cache them:

root@mariadb1:/var/lib/mysql# python3.11 ~/bin/dbsake fincore binlog-00000*
binlog-000000.ibb: total_pages=14828 cached=0 percent=0.00
binlog-000001.ibb: total_pages=262144 cached=8924 percent=3.40
binlog-000002.ibb: total_pages=262144 cached=262144 percent=100.00
binlog-000003.ibb: total_pages=262144 cached=262144 percent=100.00
binlog-000004.ibb: total_pages=262144 cached=8308 percent=3.17
binlog-000005.ibb: total_pages=262144 cached=0 percent=0.00

And indeed a bit more than 2G of RAM is used by the filesystem cache:

root@mariadb1:/var/lib/mysql# ls -lh binlog-00000*
-rw-rw----. 1 mysql mysql  58M Feb 25 11:05 binlog-000000.ibb
-rw-rw----. 1 mysql mysql 1.0G Mar 16 12:28 binlog-000001.ibb
-rw-rw----. 1 mysql mysql 1.0G Mar 16 12:40 binlog-000002.ibb
-rw-rw----. 1 mysql mysql 1.0G Mar 16 16:37 binlog-000003.ibb
-rw-rw----. 1 mysql mysql 1.0G Mar 16 16:38 binlog-000004.ibb
-rw-rw----. 1 mysql mysql 1.0G Mar 16 16:37 binlog-000005.ibb

root@mariadb1:/var/lib/mysql# free -m
               total        used        free      shared  buff/cache   available
Mem:           11508         921        8088          18        2772       10586
Swap:           4095           0        4095

A small tip for DBAs, if you need to reclaim memory for your system and the filesystem’s cache is not shrinking fast enough, instead of dropping all cached pages, I recommend only removing from the cache all the binlog but the last 2:

root@mariadb1:/var/lib/mysql# python3.11 ~/bin/dbsake uncache binlog-00000[0-3].ibb
Uncached binlog-000000.ibb
Uncached binlog-000001.ibb
Uncached binlog-000002.ibb
Uncached binlog-000003.ibb
root@mariadb1:/var/lib/mysql# free -m
               total        used        free      shared  buff/cache   available
Mem:           11508         909       10217          18         653       10599
Swap:           4095           0        4095

Performance

Now that we have discussed the implementation and what it looks like, the main goal of all this work is, of course, performance.

Mark Callaghan illustrates the improvement in his posts in detail. And I also recommend reading Kristian Nielsen’s comment, which explains why the benefit is even more than expected!

I know that Mark is often using innodb_flush_log_at_trx_commit=2 and sync_binlog = 0 in his tests, so I wanted to double-check with the safer values I usually recommend in production:

innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
innodb_data_file_buffering=OFF
innodb_data_file_write_through=OFF
innodb_log_file_buffering=OFF
innodb_log_file_write_through=OFF

And there I saw a huge benefit of using the new InnoDB-based binary logs!

Using sysbench OTLP Write Only on a very small system, I could pass from 24,475 events to 77,232!!

As you can see TPS (Transactions per second) and QPS (Queries per second) are much higher (higher is better). Those numbers, if confirmed by someone else as well (hey, Marco Tusa!), are impressive! Good job, the engineers!

Additionally, using InnoDB-based binary logs improves crash safety because the binary logs and InnoDB data files are always in sync.

In conclusion, the performance gain is significant. I would just love to see a bit more instrumentation (idea for contributions?).

While writing this post, I also found this very nice article by Adrien Obernesser. I recommend that you read on the same topic: MariaDB 12.3 – Binlog Inside InnoDB.

Give it a try and keep improving your workload thanks to MariaDB’s innovation!

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Leave a Reply

Your email address will not be published. Required fields are marked *