Recently, a benchmark comparing MariaDB and MySQL performance of ORDER BY over text columns didn’t really compare the same things. It was like comparing apples and oranges.
The results are interesting of course, but I would like to point that MySQL is focusing on Unicode Collation Algorithm compliant collations. That’s the reason why we didn’t even try to optimize non compliant collations like utf8_general_ci, utf8mb4_general_ci, utf8mb4_unicode_ci, utf8_unicode_ci and more…
Collations like utf8mb4_unicode_520_ci and utf8mb4_0900_ai_ci are based on UCA. The number in the collation defines the UCA version:
- UCA 9.0.0 (recommended)
– example: utf8mb4_0900_ai_ci
- UCA 5.2.0 (not recommended, see problems below)
– example: utf8mb4_unicode_520_ci
The default collation in MySQL 8.0 is utf8mb4_0900_ai_ci
For the comparison, it would have been better to compare utf8mb4_unicode_520_nopad_ci (in MariaDB, collation id 1270) with utf8mb4_unicode_520_ci (even if that collation is already obsolete) or utf8mb4_0900_ai_ci (in MySQL).
Now let’s have a look at what those collations are used for. We start with creating a table like this:
idint NOT NULL AUTO_INCREMENT,
0900_ai_civarchar(50) CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
unicode_520_civarchar(50) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
general_civarchar(50) CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci DEFAULT NULL,PRIMARY KEY (
id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
We can see that we have 3 varchar columns using different collations where the first 2 are UCA compliant (even if the second is less).
Let’s see the content of that table:
Now let’s see the difference when we sort them using those varchar columns:
We can see the difference and notice that the sorting using the default MySQL 8.0 collation (utf8mb4_0900_ai_ci, on the first column) is the one giving the correct result. This UCA compliant collation is sorting the different ‘a’ (𝕒 and 𝒶) as ‘a’. We can also notice difference in ordering the Japanese characters.
Currently, MariaDB doesn’t support such collation.
For more information about UCA support in MySQL and why the correct results are important take a look at those previous post by my colleagues in the engineering team:
- Sushi = Beer ?! An introduction of UTF8 support in MySQL 8.0
- MySQL 8.0 Collations: The devil is in the details
Let’s have a look at another example with Cyrillic characters to illustrate the difference between the Unicode 5.2 and 9.0:
Of course I could not write a blog post on Charsets and Collations without showing the famous Shushi = Beer example:
If you don’t remember what _as and _ci mean in the collation name, this means accent sensitive and case insensitive.
The accent sensitivity over Unicode charsets can be illustrated for example with some Japanese words where it takes all its importance:
I hope this clarifies a bit why using UCA compliant collations (default in MySQL 8.0) is important and why we are focusing on those.
Currently MySQL 8.0 support 49 utf8mb4 collations compatible with UCA 9.0.0. To get the list just run the following statement:
mysql> show collation like 'u%900%';
Thank you for using MySQL !
we recommend using:
please do not use:
– utf8_* (no support for Emoji, missing CJK characters, etc)
– utf8mb3 (no support for Emoji, missing CJK characters, etc)
– utf8mb4_general_ci (have problems like the Sushi-Beer explained above)
– utf8mb4_unicode_ci (have problems like the Sushi-Beer explained above)
– utf8mb4_unicode_520_ci: (have problems like Mother-Daddy issue, for Japanese as if it matches characters of p sound and b sound and cannot be resolved)
MySQL Shell Dump of the