MySQL: Character Sets, Unicode, and UCA compliant collations

Recently, a benchmark comparing MariaDB and MySQL performance of ORDER BY over text columns didn’t really compare the same things. It was like comparing apples and oranges.

The results are interesting of course, but I would like to point that MySQL is focusing on Unicode Collation Algorithm compliant collations. That’s the reason why we didn’t even try to optimize non compliant collations like utf8_general_ci, utf8mb4_general_ci, utf8mb4_unicode_ci, utf8_unicode_ci and more…

Collations like utf8mb4_unicode_520_ci and utf8mb4_0900_ai_ci are based on UCA. The number in the collation defines the UCA version:

  • UCA 9.0.0 (recommended)
    – example: utf8mb4_0900_ai_ci
  • UCA 5.2.0 (not recommended, see problems below)
    – example: utf8mb4_unicode_520_ci

The default collation in MySQL 8.0 is utf8mb4_0900_ai_ci

For the comparison, it would have been better to compare utf8mb4_unicode_520_nopad_ci (in MariaDB, collation id 1270) with utf8mb4_unicode_520_ci (even if that collation is already obsolete) or utf8mb4_0900_ai_ci (in MySQL).

Now let’s have a look at what those collations are used for. We start with creating a table like this:

CREATE TABLE collation_ex (
   id int NOT NULL AUTO_INCREMENT,
   0900_ai_ci varchar(50) CHARACTER SET utf8mb4 
        COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
   unicode_520_ci varchar(50) CHARACTER SET utf8mb4 
        COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
   general_ci varchar(50) CHARACTER SET utf8mb4 
        COLLATE utf8mb4_general_ci DEFAULT NULL,
   PRIMARY KEY (id)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

We can see that we have 3 varchar columns using different collations where the first 2 are UCA compliant (even if the second is less).

Let’s see the content of that table:

Now let’s see the difference when we sort them using those varchar columns:

We can see the difference and notice that the sorting using the default MySQL 8.0 collation (utf8mb4_0900_ai_ci, on the first column) is the one giving the correct result. This UCA compliant collation is sorting the different ‘a’ (𝕒 and 𝒶) as ‘a’. We can also notice difference in ordering the Japanese characters.

Currently, MariaDB doesn’t support such collation.

For more information about UCA support in MySQL and why the correct results are important take a look at those previous post by my colleagues in the engineering team:

Let’s have a look at another example with Cyrillic characters to illustrate the difference between the Unicode 5.2 and 9.0:

Of course I could not write a blog post on Charsets and Collations without showing the famous Shushi = Beer example:

If you don’t remember what _as and _ci mean in the collation name, this means accent sensitive and case insensitive.

The accent sensitivity over Unicode charsets can be illustrated for example with some Japanese words where it takes all its importance:

I hope this clarifies a bit why using UCA compliant collations (default in MySQL 8.0) is important and why we are focusing on those.

Currently MySQL 8.0 support 49 utf8mb4 collations compatible with UCA 9.0.0. To get the list just run the following statement:

mysql> show collation like 'u%900%';

Thank you for using MySQL !

In summary:

we recommend using:
utf8mb4_0900_*
utf8mb4_bin

please do not use:
utf8_* (no support for Emoji, missing CJK characters, etc)
utf8mb3 (no support for Emoji, missing CJK characters, etc)
utf8mb4_general_ci (have problems like the Sushi-Beer explained above)
utf8mb4_unicode_ci (have problems like the Sushi-Beer explained above)
utf8mb4_unicode_520_ci: (have problems like Mother-Daddy issue, for Japanese as if it matches characters of p sound and b sound and cannot be resolved)

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

2 Comments

  1. Hi Frederic,

    thank you for following up our post and the discussion! There were other requests to benchmark utf8mb4_0900_ai_ci (e.g in the Twitter discussion https://twitter.com/morgo/status/1412552857102929920), so Axel Schwenke has ran additional tests comparing MySQL with utf8mb4_0900_ai_ci and MariaDB with utf8mb4_general_ci – the results are at the below of our original post with “UPDATE” mark. You didn’t address performance in this post, but performance of MariaDB utf8mb4_general_ci is still faster than MySQL utf8mb4_0900_ai_ci.

    In our post I also addressed differences between the MariaDB and MySQL collations implementations. In short, MariaDB templates generate more specific code than MySQL. I believe that would be better for the whole community if MySQL team just update the code to get even faster utf8mb4_0900_ai_ci implementation. The code won’t run faster if you just say that the benchmarks are irrelevant. There is still room for improvement.

    I see that there was a report for MySQL about comparing sushi with beer, but I’m not sure if emoji comparison really matters for the real life applications… Besides emoji there are language specific collations which do the right things with language specific characters.

    The both latest releases of Ubuntu and Debian go with the default configuration file /etc/mysql/mariadb.conf.d/50-server.cnf which has

    #
    # * Character sets
    #
    # MySQL/MariaDB default is Latin1, but in Debian we rather default to the full
    # utf8 4-byte character set. See also client.cnf
    #
    character-set-server = utf8mb4
    collation-server = utf8mb4_general_ci

    So by default all MariaDB tables go with utf8mb4_general_ci and now I’m pretty sure that with the last update the comparison of the MySQL and MariaDB collations is 100% apples with apples 🙂

    • Hi Alexander,

      Thank you for your comment. Being UCA 9.0.0 compliant and then also make differences between emojis or other special characters (at least for me not being familiar with those languages) makes it different. My sentence about apple and oranges is related that you cannot compare a collation that takes care of this differences with one that does not.
      Cheers.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.