Apache Superset with MySQL HeatWave

We already saw how to deploy Apache Superset with MySQL HeatWave.

I just released a new version of the Terraform modules including some configuration parameters for a future development related to High Availability.

So let’s see the easiest way to deploy Superset on OCI using MySQL HeatWave.

Quick Deployment

The default deployment is the following:

The fastest and easiest way to deploy such architecture is to click on the button below:

Deploy to Oracle Cloud

When you click on the button, you will reach the following screen if you are already connected to OCI’s web console:

You need first to review and accept the Oracle Terms of Use and then you can proceed with the wizard to deploy the Superset architecture:

It’s important to select the latest version of Terraform (1.2.x) and fill the required variables. They are pre-filled already but you can change them:

We will focus on loading the sample data. We will address High Availability in the next section.

Once ready we can create all the resources from the stack:

This will take several time and the sample data load is also a long process (+/- 49 minutes). When finished you can find all the useful information in the job’s detail:

We can click on the Superset IP’s link and login with the Superset Admin Credentials:

Everything is deployed and as you can see it was very easy. Apache Superset is ready to be used.

Database High Availability

Now, let’s work on High Availability.

We could not deploy HA directly from the stack because we need to use a specific configuration setting that is not yet applicable with Terraform: sql_generate_invisible_primary_key.

Let’s see first why we need to enable this setting.

Now that we have our architecture deployed, if we want to migrate to the following architecture, we could just enable HA:

Let’s try:

We click on “Enable” in the High Availability section (and we might require to change the DB system configuration):

The DB System will configure MySQL Group Replication and provision 2 other nodes. The system will be marked as UPDATING:

When done, the DB System will be again ACTIVE:

But we can see that it’s not Highly Available ?!?… Let’s have a look at the Work Requests:

We can see that the operation failed. Let’s click on it to have more details:

The reason why it failed is now obvious.

Let’s have a look on the database itself what are those problematic tables:

SELECT i.TABLE_ID, t.NAME  
FROM INFORMATION_SCHEMA.INNODB_INDEXES i
JOIN INFORMATION_SCHEMA.INNODB_TABLES t 
  ON (i.TABLE_ID = t.TABLE_ID) 
  WHERE i.NAME='GEN_CLUST_INDEX';
+----------+----------------------------------+
| TABLE_ID | NAME                             |
+----------+----------------------------------+
|     1192 | superset/embedded_dashboards     |
|     1212 | superset/wb_health_population    |
|     1213 | superset/birth_names             |
|     1214 | superset/long_lat                |
|     1215 | superset/birth_france_by_region  |
|     1216 | superset/sf_population_polygons  |
|     1217 | superset/flights                 |
|     1218 | superset/bart_lines              |
|     1219 | superset/video_game_sales        |
|     1220 | superset/users_channels          |
|     1221 | superset/users                   |
|     1222 | superset/unicode_test            |
|     1223 | superset/threads                 |
|     1224 | superset/messages                |
|     1225 | superset/exported_stats          |
|     1226 | superset/covid_vaccines          |
|     1227 | superset/cleaned_sales_data      |
|     1228 | superset/channels                |
|     1229 | superset/channel_members         |
|     1230 | superset/FCC@00202018@0020Survey |
+----------+----------------------------------+
20 rows in set (0.0048 sec)

We can see that 19 of the tables are related to the sample data but there is also one default table not having a primary key defined: superset.embedded_dashboards (even if the table is empty).

Having a Primary Key defined in MySQL InnoDB is mandatory when using Group Replication. In fact it’s always recommended to have a primary key ! (see [1], [2], [3] and [4])

So if we want to enable High Availability for Superset, we will need to:

  1. add an invisible primary key to the previous tables
  2. create a new HA config to enable GIPK mode.

Adding an invisible primary key

Let’s create all the statements we need to add an invisible primary key to all problematic tables:

SELECT concat("ALTER TABLE ", replace(t.NAME, '/','.'), 
" add my_id bigint auto_increment invisible primary key first;") statements  
FROM INFORMATION_SCHEMA.INNODB_INDEXES i  
JOIN INFORMATION_SCHEMA.INNODB_TABLES t 
  ON (i.TABLE_ID = t.TABLE_ID)  
WHERE i.NAME='GEN_CLUST_INDEX';
+-----------------------------------------------------------------------------------------------------------+
| statements                                                                                                |
+-----------------------------------------------------------------------------------------------------------+
| ALTER TABLE superset.embedded_dashboards add my_id bigint auto_increment invisible primary key first;     |
| ALTER TABLE superset.wb_health_population add my_id bigint auto_increment invisible primary key first;    |
| ALTER TABLE superset.birth_names add my_id bigint auto_increment invisible primary key first;             |
| ALTER TABLE superset.long_lat add my_id bigint auto_increment invisible primary key first;                |
| ALTER TABLE superset.birth_france_by_region add my_id bigint auto_increment invisible primary key first;  |
| ALTER TABLE superset.sf_population_polygons add my_id bigint auto_increment invisible primary key first;  |
| ALTER TABLE superset.flights add my_id bigint auto_increment primary key first;                           |
| ALTER TABLE superset.bart_lines add my_id bigint auto_increment invisible primary key first;              |
| ALTER TABLE superset.video_game_sales add my_id bigint auto_increment invisible primary key first;        |
| ALTER TABLE superset.users_channels add my_id bigint auto_increment invisible primary key first;          |
| ALTER TABLE superset.users add my_id bigint auto_increment invisible primary key first;                   |
| ALTER TABLE superset.unicode_test add my_id bigint auto_increment invisible primary key first;            |
| ALTER TABLE superset.threads add my_id bigint auto_increment invisible primary key first;                 |
| ALTER TABLE superset.messages add my_id bigint auto_increment invisible primary key first;                |
| ALTER TABLE superset.exported_stats add my_id bigint auto_increment invisible primary key first;          |
| ALTER TABLE superset.covid_vaccines add my_id bigint auto_increment invisible primary key first;          |
| ALTER TABLE superset.cleaned_sales_data add my_id bigint auto_increment invisible primary key first;      |
| ALTER TABLE superset.channels add my_id bigint auto_increment invisible primary key first;                |
| ALTER TABLE superset.channel_members add my_id bigint auto_increment invisible primary key first;         |
| ALTER TABLE superset.FCC@00202018@0020Survey add my_id bigint auto_increment invisible primary key first; |
+-----------------------------------------------------------------------------------------------------------+
20 rows in set (0.0049 sec)

And we paste them one by one in MySQL Shell to modify all tables.

The last statement will not work, and must be:

ALTER TABLE `superset`.`FCC 2018 Survey` 
      add my_id bigint auto_increment invisible primary key first;

We can do a final check by running the query to find tables without PK, it should return 0 row.

We could already enable High Availability but in the event of new data being loaded, we have no guarantee that Superset won’t try to recreate a table without PK… and fail.

New HA config creation

As we need to create a new config for the current shape used by our instance, the best is to find the HA version of the current shape’s default configuration and use a copy of it that we will modify:

Now we need to modify our DB System instance and use the new created config:

As soon as the DB instance is using the new configuration, we can enable HA:

This should take longer than the first try. And when done, we can see that now High availabiliy is Enabled:

We can also verify in Performance_Schema:

Perfect ! Now we have Superset using an Highly Available MySQL DB System in OCI.

MySQL HeatWave as Query Accelerator

Now let’s see if Apache Superset could also benefit from the Query Accelerator HeatWave engine.

We also need to be aware that HeatWave cluster will require a HeatWave ready shape (with HeatWave in the name).

We will have to change the Shape if we plan to use HeatWave and enable again HA to have the following architecture:

Before creating a HeatWave cluster, let’s check if there are some queries that could benefit from being accelerated. Usually those queries are the slow one or those having a high cost (the default threshold to offload the query to HeatWave is 100,000, the value of secondary_engine_cost_threshold).

We can use the following query to have a list of potential candidates:

select schema_name, format_pico_time(total_latency) tot_lat,
       exec_count, format_pico_time(total_latency/exec_count) latency_per_call, 
       query_sample_text
  from sys.x$statements_with_runtimes_in_95th_percentile as t1      
  join performance_schema.events_statements_summary_by_digest as t2        
    on t2.digest=t1.digest
 where schema_name ='superset' and query_sample_text like 'select%' 
 order by (total_latency/exec_count) desc limit 7\G
*************************** 1. row ***************************
      schema_name: superset
          tot_lat: 626.85 ms
       exec_count: 1
 latency_per_call: 626.85 ms
query_sample_text: SELECT country_code AS country_code,
       sum(`SP_RUR_TOTL_ZS`) AS `sum__SP_RUR_TOTL_ZS`,
       sum(`SP_RUR_TOTL`) AS `Rural Population`
FROM superset.wb_health_population
WHERE year >= STR_TO_DATE('2014-01-01 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
  AND year < STR_TO_DATE('2014-01-02 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
GROUP BY country_code
LIMIT 5000
*************************** 2. row ***************************
      schema_name: superset
          tot_lat: 513.12 ms
       exec_count: 1
 latency_per_call: 513.12 ms
query_sample_text: SELECT CASE
           WHEN is_software_dev = 0 THEN 'Aspiring Developer'
           WHEN is_software_dev = 1 THEN 'Currently A Developer'
       END AS developer_type,
       COUNT(*) AS count
FROM superset.`FCC 2018 Survey`
GROUP BY CASE
             WHEN is_software_dev = 0 THEN 'Aspiring Developer'
             WHEN is_software_dev = 1 THEN 'Currently A Developer'
         END
ORDER BY count DESC
LIMIT 10000
*************************** 3. row ***************************
      schema_name: superset
          tot_lat: 491.93 ms
       exec_count: 1
 latency_per_call: 491.93 ms
query_sample_text: SELECT CASE
           WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
           WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
           ELSE 'No Answer'
       END AS ethnic_minority,
       gender AS gender,
       COUNT(*) AS count
FROM superset.`FCC 2018 Survey`
WHERE CASE
          WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
          WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
          ELSE 'No Answer'
      END != 'NULL'
  AND gender != 'NULL'
GROUP BY CASE
             WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
             WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
             ELSE 'No Answer'
         END,
         gender
LIMIT 5000
*************************** 4. row ***************************
      schema_name: superset
          tot_lat: 488.80 ms
       exec_count: 1
 latency_per_call: 488.80 ms
query_sample_text: SELECT gender AS gender,
       COUNT(*) AS count
FROM superset.`FCC 2018 Survey`
GROUP BY gender
ORDER BY count DESC
LIMIT 10000
*************************** 5. row ***************************
      schema_name: superset
          tot_lat: 485.92 ms
       exec_count: 1
 latency_per_call: 485.92 ms
query_sample_text: SELECT CASE
           WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
           WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
           ELSE 'No Answer'
       END AS ethnic_minority
FROM superset.`FCC 2018 Survey`
GROUP BY CASE
             WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
             WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
             ELSE 'No Answer'
         END
LIMIT 1000
*************************** 6. row ***************************
      schema_name: superset
          tot_lat: 374.23 ms
       exec_count: 1
 latency_per_call: 374.23 ms
query_sample_text: SELECT CASE
           WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
           WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
           ELSE 'No Answer'
       END AS ethnic_minority,
       COUNT(*) AS count
FROM superset.`FCC 2018 Survey`
GROUP BY CASE
             WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
             WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
             ELSE 'No Answer'
         END
ORDER BY count DESC
LIMIT 10000
*************************** 7. row ***************************
      schema_name: superset
          tot_lat: 97.84 ms
       exec_count: 2
 latency_per_call: 48.92 ms
query_sample_text: SELECT channel_1 AS channel_1,        
                          channel_2 AS channel_2,
                          sum(cnt) AS `SUM(cnt)` 
                   FROM   (
                       SELECT uc1.name as channel_1,
                              uc2.name as channel_2,
                              count(*) AS cnt    
                       FROM users_channels uc1    
                       JOIN users_channels uc2 ON uc1.user_id = uc2.user_id
                   GROUP BY uc1.name, uc2.name    
                     HAVING uc1.name <> uc2.name) AS virtual_table 
                   GROUP BY channel_1,
                            channel_2 LIMIT 1000

Let’s have a look at the cost of these queries:

\P grep query_cost
Pager has been set to 'grep query_cost'.

EXPLAIN format=json SELECT country_code AS country_code,
       sum(`SP_RUR_TOTL_ZS`) AS `sum__SP_RUR_TOTL_ZS`,
       sum(`SP_RUR_TOTL`) AS `Rural Population`
FROM superset.wb_health_population
WHERE year >= STR_TO_DATE('2014-01-01 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
  AND year < STR_TO_DATE('2014-01-02 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
GROUP BY country_code
LIMIT 5000;

 "query_cost": "2576.20"

EXPLAIN format=json SELECT CASE
           WHEN is_software_dev = 0 THEN 'Aspiring Developer'
           WHEN is_software_dev = 1 THEN 'Currently A Developer'
       END AS developer_type,
       COUNT(*) AS count
FROM superset.`FCC 2018 Survey`
GROUP BY CASE
             WHEN is_software_dev = 0 THEN 'Aspiring Developer'
             WHEN is_software_dev = 1 THEN 'Currently A Developer'
         END
ORDER BY count DESC
LIMIT 10000

"query_cost": "3736.20"

EXPLAN format=json SELECT CASE
           WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
           WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
           ELSE 'No Answer'
       END AS ethnic_minority,
       gender AS gender,
       COUNT(*) AS count
FROM superset.`FCC 2018 Survey`
WHERE CASE
          WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
          WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
          ELSE 'No Answer'
      END != 'NULL'
  AND gender != 'NULL'
GROUP BY CASE
             WHEN is_ethnic_minority = 0 THEN 'No, not an ethnic minority'
             WHEN is_ethnic_minority = 1 THEN 'Yes, an ethnic minority'
             ELSE 'No Answer'
         END,
         gender
LIMIT 5000

"query_cost": "3736.20"

EXPLAIN format=json SELECT channel_1 AS channel_1,        
                          channel_2 AS channel_2,
                          sum(cnt) AS `SUM(cnt)` 
                   FROM   (
                       SELECT uc1.name as channel_1,
                              uc2.name as channel_2,
                              count(*) AS cnt    
                       FROM users_channels uc1    
                       JOIN users_channels uc2 ON uc1.user_id = uc2.user_id
                   GROUP BY uc1.name, uc2.name    
                     HAVING uc1.name <> uc2.name) AS virtual_table 
                   GROUP BY channel_1,
                            channel_2 LIMIT 1000\G

"query_cost": "823560.21"

After browsing some dashboards of the sample dataset, we can see that most queries are already fast but the last one will certainly benefit from MySQL HeatWave’s Query Accelerator… let’s enable HeatWave Cluster and test:

Now let’s load the Superset database into the HeatWave Cluster:

CALL sys.heatwave_load(JSON_ARRAY('superset'), NULL);
....
+-------------------------------------------------------------------------------+
| LOAD SUMMARY                                                                  |
+-------------------------------------------------------------------------------+
|                                                                               |
| SCHEMA                          TABLES       TABLES      COLUMNS         LOAD |
| NAME                            LOADED       FAILED       LOADED     DURATION |
| ------                          ------       ------      -------     -------- |
| `superset`                          85            0         1377      19.88 s |
|                                                                               |
+-------------------------------------------------------------------------------+

Now I will just browse again the different dashboards in Superset…

Then we check again:

show  status like 'rapid_query_offload_count';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| rapid_query_offload_count | 1     |
+---------------------------+-------+
1 row in set (0.0010 sec)

There is only one, and even if I keep browsing the dashboards, it seems no new queries are offloaded to HeatWave.

I loaded extra data (airportdb), and added a new query I knew would benefit from HeatWave:

I could see that when I was running it in Superset, the value of rapid_query_offload_count didn’t increase…

But when I was running it manually in MySQL Shell, HeatWave (RAPID) secondary engine was used:

EXPLAIN SELECT airlinename, sum(sold_seat)/sum(capacity) 'load factor' 
          FROM ( 
              SELECT  flight_id, COUNT(*) sold_seat 
                FROM booking GROUP BY flight_id) seat_sold 
                JOIN flight using (flight_id) 
                JOIN airplane USING (airplane_id) 
                JOIN airline ON airline.airline_id = flight.airline_id 
            GROUP BY airlinename ORDER BY airlinename LIMIT 1001\G
*************************** 1. row ***************************
           id: 1
  select_type: NONE
        table: NULL
   partitions: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
     filtered: NULL
        Extra: Using secondary engine RAPID. Use EXPLAIN FORMAT=TREE to show the plan.

And of course, the counter rapid_query_offload_count also increased…

I checked the Query Execution Plan (EXPLAIN) directly from Superset:

Ah !? The query is not offloaded to HeatWave !

My colleague Olivier provided me the solution: HeatWave requires auto_commit to offload queries. This is not the default in SQL Alchemy used by Superset.

We just need to add {"isolation_level":"AUTOCOMMIT"} in the extra parameters for the Database Connection:

And we can verify again the query and its Query Execution Plan in Superset:

The query is already much faster !

And this is because now it’s correctly accelerated by HeatWave !

After the database connection change and after browsing several dashboards, we can already notice that Superset benefits from HeatWave:

show status like 'rapid_query_offload_count';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| rapid_query_offload_count | 16    |
+---------------------------+-------+
1 row in set (0.0010 sec)

And we can see that 888MB were scanned from HeatWave:

show global status like 'hw_data%';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| hw_data_scanned | 888   |
+-----------------+-------+
1 row in set (0.0011 sec)

For more information about how to manage HeatWave engine, check this article.

Conclusion

It’s very easy to deploy Apache Superset in OCI with MySQL HeatWave Database Service as backend. This combination created a powerful ecosystem for data analytics and visualization. You also learned how to upgrade the architecture to use High Availability and/or HeatWave Cluster to achieve exceptional performance, scalability and security for any kind of data exploration and visualization.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

As MySQL Community Manager, I am an employee of Oracle and the views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

You can find articles I wrote on Oracle’s blog.