Prior to MySQL 8.0, MySQL was missing a very well know feature in other RDBMS: optimizer histograms.
The Optimizer Team implemented this feature as more and more MySQL DBAs were requesting it.
Definition
But what are histograms ? For wikipedia, a histogram is an accurate representation of the distribution of numerical data. For RDBMS, a histogram is an approximation of the data distribution within a specific column.
So in MySQL, histograms help the optimizer to find the most efficient Query Plan.
Example
To illustrate how histograms influence the optimizer, I will use data from dbt3.
My query is simple:
SELECT * FROM orders JOIN customer ON o_custkey = c_custkey WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
Let’s have a look at the traditional EXPLAIN output and then a the VISUAL one:
mysql> EXPLAIN SELECT * FROM orders JOIN customer ON o_custkey = c_custkey WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: customer partitions: NULL type: ALL possible_keys: PRIMARY key: NULL key_len: NULL ref: NULL rows: 149050 filtered: 10.00 Extra: Using where *************************** 2. row *************************** id: 1 select_type: SIMPLE table: orders partitions: NULL type: ref possible_keys: i_o_custkey,i_o_orderdate key: i_o_custkey key_len: 5 ref: dbt3.customer.c_custkey rows: 14 filtered: 30.62 Extra: Using where 2 rows in set, 1 warning (0.28 sec)
We can see that we first do a full table scan on customer
and the selectivity estimate (filtered) is 10%.
Let’s run the query (I used count(*)
) and see how many rows we went through:
mysql> SELECT count(*) FROM orders JOIN customer ON o_custkey = c_custkey WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G *************************** 1. row *************************** count(*): 45127 1 row in set (49.98 sec)
Histogram Creation
Now I will create an histogram on customer
for the column c_mktsegment
:
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_mktsegment WITH 1024 BUCKETS; +---------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------------+-----------+----------+---------------------------------------------------------+ | dbt3.customer | histogram | status | Histogram statistics created for column 'c_mktsegment'. | +---------------+-----------+----------+---------------------------------------------------------+
We can already verify the query execution plan:
mysql> EXPLAIN SELECT * FROM orders JOIN customer ON o_custkey = c_custkey WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: orders partitions: NULL type: ALL possible_keys: i_o_custkey,i_o_orderdate key: NULL key_len: NULL ref: NULL rows: 1494230 filtered: 30.62 Extra: Using where *************************** 2. row *************************** id: 1 select_type: SIMPLE table: customer partitions: NULL type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: dbt3.orders.o_custkey rows: 1 filtered: 19.84 Extra: Using where 2 rows in set, 1 warning (1.06 sec)
Now with the histogram we can see that it becomes less attractive to start with customer
table since almost twice as many rows (19.84%) will cause look-ups into the order
table.
The cost seem higher but the optimizer select to fully scan the orders table, let’s have a look at the execution time:
mysql> SELECT count(*) FROM orders JOIN customer ON o_custkey = c_custkey WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G *************************** 1. row *************************** count(*): 45127 1 row in set (6.35 sec)
Much faster !
View the distribution
The histogram is stored in the Information_Schema.column_statistics
table. This is the table definition:
+-------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+-------------+------+-----+---------+-------+ | SCHEMA_NAME | varchar(64) | NO | | NULL | | | TABLE_NAME | varchar(64) | NO | | NULL | | | COLUMN_NAME | varchar(64) | NO | | NULL | | | HISTOGRAM | json | NO | | NULL | | +-------------+-------------+------+-----+---------+-------+
This is how a record looks like:
SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) FROM information_schema.column_statistics WHERE COLUMN_NAME = 'c_mktsegment'\G *************************** 1. row *************************** SCHEMA_NAME: dbt3 TABLE_NAME: customer COLUMN_NAME: c_mktsegment JSON_PRETTY(HISTOGRAM): { "buckets": [ [ "base64:type254:QVVUT01PQklMRQ==", 0.19837010534684954 ], [ "base64:type254:QlVJTERJTkc=", 0.3983104750546611 ], [ "base64:type254:RlVSTklUVVJF", 0.5978433710991851 ], [ "base64:type254:SE9VU0VIT0xE", 0.799801232359372 ], [ "base64:type254:TUFDSElORVJZ", 1.0 ] ], "data-type": "string", "null-values": 0.0, "collation-id": 255, "last-updated": "2018-03-02 20:21:48.271523", "sampling-rate": 0.6709158000670916, "histogram-type": "singleton", "number-of-buckets-specified": 1024 }
And it’s also possible to see the distribution:
SELECT FROM_BASE64(SUBSTRING_INDEX(v, ':', -1)) value, concat(round(c*100,1),'%') cumulfreq, CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets', '$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist WHERE schema_name = 'dbt3' and table_name = 'customer' and column_name = 'c_mktsegment'; +------------+-----------+-------+ | value | cumulfreq | freq | +------------+-----------+-------+ | AUTOMOBILE | 19.8% | 19.8% | | BUILDING | 39.9% | 20.1% | | FURNITURE | 59.9% | 19.9% | | HOUSEHOLD | 79.9% | 20.1% | | MACHINERY | 100.0% | 20.1% | +------------+-----------+-------+
You can also remove an histogram using the following syntax:
mysql> ANALYZE TABLE customer DROP HISTOGRAM on c_mktsegment; +---------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------------+-----------+----------+---------------------------------------------------------+ | dbt3.customer | histogram | status | Histogram statistics removed for column 'c_mktsegment'. | +---------------+-----------+----------+---------------------------------------------------------+ 1 row in set (0.00 sec)
Buckets
You noticed that when we create an histogram, we need to specify the amount of buckets. In fact the data is split into a set of buckets containing specific values and their cardinality. If you check back the type of histogram in the previous example you could see that it was singleton:
"histogram-type": "singleton",
This is the best as the cardinality is for a single specific value. If I recreate the histogram using only 2 buckets this time (remember there where 4 distinct value in the column c_mktsegment
):
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_mktsegment WITH 2 BUCKETS; +---------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------------+-----------+----------+---------------------------------------------------------+ | dbt3.customer | histogram | status | Histogram statistics created for column 'c_mktsegment'. | +---------------+-----------+----------+---------------------------------------------------------+
If we check the histogram type:
mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) FROM information_schema.column_statistics WHERE COLUMN_NAME = 'c_mktsegment'\G *************************** 1. row *************************** SCHEMA_NAME: dbt3 TABLE_NAME: customer COLUMN_NAME: c_mktsegment JSON_PRETTY(HISTOGRAM): { "buckets": [ [ "base64:type254:QVVUT01PQklMRQ==", "base64:type254:RlVSTklUVVJF", 0.5996992690844636, 3 ], [ "base64:type254:SE9VU0VIT0xE", "base64:type254:TUFDSElORVJZ", 1.0, 2 ] ], "data-type": "string", "null-values": 0.0, "collation-id": 255, "last-updated": "2018-03-02 20:42:26.165898", "sampling-rate": 0.6709158000670916, "histogram-type": "equi-height", "number-of-buckets-specified": 2 }
Now the histogram-type is equi-height
, which means that contiguous ranges of values are grouped into buckets such that the number of data items falling into each bucket is the same.
Conclusion
Histograms are useful for columns that are not the first column of an index and that are used in WHERE conditions of queries with JOIN, IN-subqueries or ORDER BY… LIMIT.
Also consider to try using enough buckets to get singleton histograms.
You can find more information on histograms in Erik’s blog post and Øystein’s presentation during FOSDEM.
I wanted to add that histograms are also very powerful when you filter (in the where clause) on a column that is not indexed and you are matching against a value that doesn’t exist. The Optimizer will know very quickly there won’t be any rows matching.
[…] If you are using an old application, no the new authentication plugin doesn’t break your application, until you don’t create a new user for it and not specify an authentication method compatible with your connector. But of course other things, like reserved keywords in this case, can be problematic. This is why an major release upgrade always need to be tested in advance. Not only for schema and syntax compatibility but also for performance as the query execution plan might not be the one you expect event if in most cases the MySQL Optimizer becomes smarter and smarter with the releases and has the support of new features like the histograms. […]
[…] Histograms — гистограммы для анализа плана выполнения запроса. […]
[…] Histograms — гистограммы для анализа плана выполнения запроса. […]
[…] 关于直方图影响查询计划,这篇博客 及 这篇博客 […]
Thank you for this tutorial. Just what I needed!