MySQL 8.0 Histograms

Prior to MySQL 8.0, MySQL was missing a very well know feature in other RDBMS: optimizer histograms.

The Optimizer Team implemented this feature as more and more MySQL DBAs were requesting it.

Definition

But what are histograms ? For wikipedia, a histogram is an accurate representation of the distribution of numerical data. For RDBMS, a histogram is an approximation of the data distribution within a specific column.

So in MySQL, histograms help the optimizer to find the most efficient Query Plan.

Example

To illustrate how histograms  influence the optimizer, I will use data from dbt3.

My query is simple:

SELECT * FROM orders  
  JOIN customer ON o_custkey = c_custkey 
WHERE o_orderdate < '1993-01-01' 
  AND c_mktsegment = "AUTOMOBILE"\G

Let’s have a look at the traditional EXPLAIN output and then a the VISUAL one:

mysql> EXPLAIN SELECT * FROM orders  
       JOIN customer ON o_custkey = c_custkey 
       WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: customer
   partitions: NULL
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 149050
     filtered: 10.00
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: orders
   partitions: NULL
         type: ref
possible_keys: i_o_custkey,i_o_orderdate
          key: i_o_custkey
      key_len: 5
          ref: dbt3.customer.c_custkey
         rows: 14
     filtered: 30.62
        Extra: Using where
2 rows in set, 1 warning (0.28 sec)

We can see that we first do a full table scan on customer and the selectivity estimate (filtered) is 10%.

Let’s run the query (I used count(*)) and see how many rows we went through:

mysql> SELECT count(*) FROM orders  
       JOIN customer ON o_custkey = c_custkey 
       WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
*************************** 1. row ***************************
count(*): 45127
1 row in set (49.98 sec)

Histogram Creation

Now I will create an histogram on customer for the column c_mktsegment:

mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_mktsegment WITH 1024 BUCKETS;
+---------------+-----------+----------+---------------------------------------------------------+
| Table         | Op        | Msg_type | Msg_text                                                |
+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status   | Histogram statistics created for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+

We can already verify the query execution plan:

mysql> EXPLAIN SELECT * FROM orders  
               JOIN customer ON o_custkey = c_custkey 
               WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: orders
   partitions: NULL
         type: ALL
possible_keys: i_o_custkey,i_o_orderdate
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1494230
     filtered: 30.62
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: customer
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: dbt3.orders.o_custkey
         rows: 1
     filtered: 19.84
        Extra: Using where
2 rows in set, 1 warning (1.06 sec)

Now with the histogram we can see that it becomes less attractive to start with customer table since almost twice as many rows (19.84%) will cause look-ups into the order table.

The cost seem higher but the optimizer select to fully scan the orders table, let’s have a look at the execution time:

mysql> SELECT count(*) FROM orders  
       JOIN customer ON o_custkey = c_custkey 
       WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
*************************** 1. row ***************************
count(*): 45127
1 row in set (6.35 sec)

Much faster !

View the distribution

The histogram is stored in the Information_Schema.column_statistics table. This is the table definition:

+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| SCHEMA_NAME | varchar(64) | NO   |     | NULL    |       |
| TABLE_NAME  | varchar(64) | NO   |     | NULL    |       |
| COLUMN_NAME | varchar(64) | NO   |     | NULL    |       |
| HISTOGRAM   | json        | NO   |     | NULL    |       |
+-------------+-------------+------+-----+---------+-------+

This is how a record looks like:

SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) 
FROM information_schema.column_statistics 
WHERE COLUMN_NAME = 'c_mktsegment'\G
*************************** 1. row ***************************
           SCHEMA_NAME: dbt3
            TABLE_NAME: customer
           COLUMN_NAME: c_mktsegment
JSON_PRETTY(HISTOGRAM): {
  "buckets": [
    [
      "base64:type254:QVVUT01PQklMRQ==",
      0.19837010534684954
    ],
    [
      "base64:type254:QlVJTERJTkc=",
      0.3983104750546611
    ],
    [
      "base64:type254:RlVSTklUVVJF",
      0.5978433710991851
    ],
    [
      "base64:type254:SE9VU0VIT0xE",
      0.799801232359372
    ],
    [
      "base64:type254:TUFDSElORVJZ",
      1.0
    ]
  ],
  "data-type": "string",
  "null-values": 0.0,
  "collation-id": 255,
  "last-updated": "2018-03-02 20:21:48.271523",
  "sampling-rate": 0.6709158000670916,
  "histogram-type": "singleton",
  "number-of-buckets-specified": 1024
}

And it’s also possible to see the distribution:

SELECT FROM_BASE64(SUBSTRING_INDEX(v, ':', -1)) value, concat(round(c*100,1),'%') cumulfreq, 
       CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq  
FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets', 
     '$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist  
WHERE schema_name  = 'dbt3' and table_name = 'customer' and column_name = 'c_mktsegment';
+------------+-----------+-------+
| value      | cumulfreq | freq  |
+------------+-----------+-------+
| AUTOMOBILE | 19.8%     | 19.8% |
| BUILDING   | 39.9%     | 20.1% |
| FURNITURE  | 59.9%     | 19.9% |
| HOUSEHOLD  | 79.9%     | 20.1% |
| MACHINERY  | 100.0%    | 20.1% |
+------------+-----------+-------+

You can also remove an histogram using the following syntax:

mysql> ANALYZE TABLE customer DROP HISTOGRAM on c_mktsegment;
+---------------+-----------+----------+---------------------------------------------------------+
| Table         | Op        | Msg_type | Msg_text                                                |
+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status   | Histogram statistics removed for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+
1 row in set (0.00 sec)

Buckets

You noticed that when we create an histogram, we need to specify the amount of buckets. In fact the data is split into a set of buckets containing specific values and their cardinality. If you check back the type of histogram in the previous example you could see that it was singleton:

"histogram-type": "singleton",

This is the best as the cardinality is for a single specific value. If I recreate the histogram using only 2 buckets this time (remember there where 4 distinct value in the column c_mktsegment):

mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_mktsegment WITH 2 BUCKETS;
+---------------+-----------+----------+---------------------------------------------------------+
| Table         | Op        | Msg_type | Msg_text                                                |
+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status   | Histogram statistics created for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+

If we check the histogram type:

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, 
              JSON_PRETTY(HISTOGRAM) 
       FROM information_schema.column_statistics 
      WHERE COLUMN_NAME = 'c_mktsegment'\G
*************************** 1. row ***************************
           SCHEMA_NAME: dbt3
            TABLE_NAME: customer
           COLUMN_NAME: c_mktsegment
JSON_PRETTY(HISTOGRAM): {
  "buckets": [
    [
      "base64:type254:QVVUT01PQklMRQ==",
      "base64:type254:RlVSTklUVVJF",
      0.5996992690844636,
      3
    ],
    [
      "base64:type254:SE9VU0VIT0xE",
      "base64:type254:TUFDSElORVJZ",
      1.0,
      2
    ]
  ],
  "data-type": "string",
  "null-values": 0.0,
  "collation-id": 255,
  "last-updated": "2018-03-02 20:42:26.165898",
  "sampling-rate": 0.6709158000670916,
  "histogram-type": "equi-height",
  "number-of-buckets-specified": 2
}

Now the histogram-type is equi-height, which means that contiguous ranges of values are grouped into buckets such that the number of data items falling into each bucket is the same.

Conclusion

Histograms are useful for columns that are not the first column of an index and that are used in WHERE conditions of queries with JOIN, IN-subqueries or ORDER BY… LIMIT.

Also consider to try using enough buckets to get singleton histograms.

You can find more information on histograms in Erik’s blog post and Øystein’s presentation during FOSDEM.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

6 Comments

  1. I wanted to add that histograms are also very powerful when you filter (in the where clause) on a column that is not indexed and you are matching against a value that doesn’t exist. The Optimizer will know very quickly there won’t be any rows matching.

  2. […] If you are using an old application, no the new authentication plugin doesn’t break your application, until you don’t create a new user for it and not specify an authentication method compatible with your connector. But of course other things, like reserved keywords in this case, can be problematic. This is why an major release upgrade always need to be tested in advance. Not only for schema and syntax compatibility but also for performance as the query execution plan might not be the one you expect event if in most cases the MySQL Optimizer becomes smarter and smarter with the releases and has the support of new features like the histograms. […]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

As MySQL Community Manager, I am an employee of Oracle and the views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

You can find articles I wrote on Oracle’s blog.