MySQL: Character Sets, Unicode, and UCA compliant collations

Recently, a benchmark comparing MariaDB and MySQL performance of ORDER BY over text columns didn’t really compare the same things. It was like comparing apples and oranges.

The results are interesting of course, but I would like to point that MySQL is focusing on Unicode Collation Algorithm compliant collations. That’s the reason why we didn’t even try to optimize non compliant collations like utf8_general_ci, utf8mb4_general_ci, utf8mb4_unicode_ci, utf8_unicode_ci and more…

Collations like utf8mb4_unicode_520_ci and utf8mb4_0900_ai_ci are based on UCA. The number in the collation defines the UCA version:

UCA 9.0.0 (recommended)
– example: utf8mb4_0900_ai_ci
UCA 5.2.0 (not recommended, see problems below)
– example: utf8mb4_unicode_520_ci

The default collation in MySQL 8.0 is utf8mb4_0900_ai_ci

For the comparison, it would have been better to compare utf8mb4_unicode_520_nopad_ci (in MariaDB, collation id 1270) with utf8mb4_unicode_520_ci (even if that collation is already obsolete) or utf8mb4_0900_ai_ci (in MySQL).

Now let’s have a look at what those collations are used for. We start with creating a table like this:

CREATE TABLE collation_ex (
   id int NOT NULL AUTO_INCREMENT,
   0900_ai_ci varchar(50) CHARACTER SET utf8mb4 
        COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
   unicode_520_ci varchar(50) CHARACTER SET utf8mb4 
        COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
   general_ci varchar(50) CHARACTER SET utf8mb4 
        COLLATE utf8mb4_general_ci DEFAULT NULL,
   PRIMARY KEY (id)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

We can see that we have 3 varchar columns using different collations where the first 2 are UCA compliant (even if the second is less).

Let’s see the content of that table:

Now let’s see the difference when we sort them using those varchar columns:

We can see the difference and notice that the sorting using the default MySQL 8.0 collation (utf8mb4_0900_ai_ci, on the first column) is the one giving the correct result. This UCA compliant collation is sorting the different ‘a’ (𝕒 and 𝒶) as ‘a’. We can also notice difference in ordering the Japanese characters.

Currently, MariaDB doesn’t support such collation.

For more information about UCA support in MySQL and why the correct results are important take a look at those previous post by my colleagues in the engineering team:

Let’s have a look at another example with Cyrillic characters to illustrate the difference between the Unicode 5.2 and 9.0:

Of course I could not write a blog post on Charsets and Collations without showing the famous Shushi = Beer example:

If you don’t remember what _as and _ci mean in the collation name, this means accent sensitive and case insensitive.

The accent sensitivity over Unicode charsets can be illustrated for example with some Japanese words where it takes all its importance:

I hope this clarifies a bit why using UCA compliant collations (default in MySQL 8.0) is important and why we are focusing on those.

Currently MySQL 8.0 support 49 utf8mb4 collations compatible with UCA 9.0.0. To get the list just run the following statement:

mysql> show collation like 'u%900%';

Thank you for using MySQL !

In summary:

we recommend using:
– utf8mb4_0900_*
– utf8mb4_bin

please do not use:
– utf8_* (no support for Emoji, missing CJK characters, etc)
– utf8mb3 (no support for Emoji, missing CJK characters, etc)
– utf8mb4_general_ci (have problems like the Sushi-Beer explained above)
– utf8mb4_unicode_ci (have problems like the Sushi-Beer explained above)
– utf8mb4_unicode_520_ci: (have problems like Mother-Daddy issue, for Japanese as if it matches characters of p sound and b sound and cannot be resolved)

MySQL Shell Dump of the collation_ex table:

collation_ex Download

2 Comments

Alexander Krizhanovsky

July 20, 2021 / 16:12 Reply

Hi Frederic,

thank you for following up our post and the discussion! There were other requests to benchmark utf8mb4_0900_ai_ci (e.g in the Twitter discussion https://twitter.com/morgo/status/1412552857102929920), so Axel Schwenke has ran additional tests comparing MySQL with utf8mb4_0900_ai_ci and MariaDB with utf8mb4_general_ci – the results are at the below of our original post with “UPDATE” mark. You didn’t address performance in this post, but performance of MariaDB utf8mb4_general_ci is still faster than MySQL utf8mb4_0900_ai_ci.

In our post I also addressed differences between the MariaDB and MySQL collations implementations. In short, MariaDB templates generate more specific code than MySQL. I believe that would be better for the whole community if MySQL team just update the code to get even faster utf8mb4_0900_ai_ci implementation. The code won’t run faster if you just say that the benchmarks are irrelevant. There is still room for improvement.

I see that there was a report for MySQL about comparing sushi with beer, but I’m not sure if emoji comparison really matters for the real life applications… Besides emoji there are language specific collations which do the right things with language specific characters.

The both latest releases of Ubuntu and Debian go with the default configuration file /etc/mysql/mariadb.conf.d/50-server.cnf which has

#
# * Character sets
#
# MySQL/MariaDB default is Latin1, but in Debian we rather default to the full
# utf8 4-byte character set. See also client.cnf
#
character-set-server = utf8mb4
collation-server = utf8mb4_general_ci

So by default all MariaDB tables go with utf8mb4_general_ci and now I’m pretty sure that with the last update the comparison of the MySQL and MariaDB collations is 100% apples with apples 🙂
- lefred
  
  July 20, 2021 / 16:18 Reply
  
  Hi Alexander,
  
  Thank you for your comment. Being UCA 9.0.0 compliant and then also make differences between emojis or other special characters (at least for me not being familiar with those languages) makes it different. My sentence about apple and oranges is related that you cannot compare a collation that takes care of this differences with one that does not.
  Cheers.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

MySQL: Character Sets, Unicode, and UCA compliant collations

follow me

Subscribe to Blog via Email

2 Comments

Leave a Reply to Alexander KrizhanovskyCancel Reply