MySQL data archiving: another use for HeatWave Lakehouse

The ability to store data on Object Storage and retrieve it dynamically when necessary is a notable advantage of Lakehouse when managing MySQL historical data we would like to archive.

Let’s illustrate this with the following table:

CREATE TABLE `temperature_history` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `time_stamp` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `device_id` varchar(30) DEFAULT NULL,
  `value` decimal(5,2) NOT NULL DEFAULT '0.00',
  `day_date` date GENERATED ALWAYS AS (cast(`time_stamp` as date)) STORED NOT NULL,
  PRIMARY KEY (`id`,`day_date`),
  KEY `device_id_idx` (`device_id`)
) ENGINE=InnoDB AUTO_INCREMENT=129428417 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci SECONDARY_ENGINE=RAPID
/*!50500 PARTITION BY RANGE  COLUMNS(day_date)
(PARTITION p0_before2023_11 VALUES LESS THAN ('2023-11-01') ENGINE = InnoDB,
 PARTITION p2023_12 VALUES LESS THAN ('2023-12-01') ENGINE = InnoDB,
 PARTITION p2024_01 VALUES LESS THAN ('2024-01-01') ENGINE = InnoDB,
 PARTITION p2024_02 VALUES LESS THAN ('2024-02-01') ENGINE = InnoDB,
 PARTITION p2024_03 VALUES LESS THAN ('2024-03-01') ENGINE = InnoDB,
 PARTITION p2024_04 VALUES LESS THAN ('2024-04-01') ENGINE = InnoDB,
 PARTITION p9_future VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB) */

You can notice that the table is also loaded in MySQL HeaWave Cluster (see this previous post).

This table is full of records that were generated by IoT devices:

select count(*) from temperature_history;
+-----------+
| count(*)  |
+-----------+
| 129428416 |
+-----------+
1 row in set (0.0401 sec)

Mind the response time 😉

You can also notice that we have partitions. The plan to save disk space and archive the data on cheap storage is the following:

  1. one a partition is not required anymore, we dump its content to an Object Storage bucket.
  2. we drop the partition
  3. if it’s the first time, we create a second archive table from the the HeatWave load report
  4. if needed we load/unload the data on demand
  5. we can create a new future partition (optional)

Dumping a Partition to Object Storage

Now the partition with data before December can be archived. Let’s see how much data this represents:

select count(*) from temperature_history partition(p2023_12);
+----------+
| count(*) |
+----------+
|  1894194 |
+----------+
1 row in set (0.0373 sec) 

Object Storage Bucket & PAR

Now we need to create a bucket where we will archive the data for the temperature_history table:

We will use a Pre-Authenticated Request (PAR) to write and read data in Object Storage:

It’s important to allow read and writes and the listing of the objects:

And this is the PAR’s url we will use:

Data Transfer

We use MySQL Shell dumpTables() utility to copy the data from the partition to Object Storage using the PAR url:

util.dumpTables("piday", ["temperature_history"],
     "https://<namespace>.objectstorage.<region>.oci.customer-oci.com/p/<random>/n/<namespave>/b/temperature_history_archive/o/",
     {"dialect": "csv", "compression": "none", "partitions":
                 {"piday.temperature_history": ["p2023_12"]}
     })

It’s very important to specify to not compress the files as by default they are.

From OCI Console, we can see all the generated files and we need to keep only the csv ones:

If you have a very large table with a log of files (chunks), deleting all the .idx files is a long process, you can then use a tool like Fileon – S3 Browser:

Partition Management

Now that the data is stored in Object Storage, we can delete the partition:

alter table temperature_history drop partition p2023_12;

As we are working on the partition, we can already add an extra one (above optional point 5) using the following syntax:

alter table temperature_history reorganize partition p9_future 
      into (
            partition p2024_05 values less than ('2024-05-01'),
            partition p9_future values less than (maxvalue)
      );

Archive Table Creation

The first time, we need to create the archived table in which we will load the Object Storage data used for Lakehouse.

Lakehouse

We need to have a HeatWave Cluster with Lakehouse enabled:

We need to prepare the system with the table and partition we want to load. For this operation, we need to set 3 variables:

  • db_list: list of the database we will load
  • dl_tables: list of the tables we will load and the information related to the format and the Object Storage location
  • options: preparation of the arguments for the heatwave_load procedure. We also parse and include the dl_tables variable

db_list

We start by defining the db_list. In our case it’s easy has we only use one database: piday

SET @db_list = '["piday"]';

dl_tables

We need to provide information related to the table we want to create and specify where and how is the table stored:

SET @dl_tables='[{"db_name": "piday","tables": 
    [
     {"table_name": "temperature_history_archive",
      "dialect": 
        {"format": "csv", 
          "field_delimiter": "\,", 
          "record_delimiter": "\\n"
        },
      "file": [{"par": "https://...<the_par_url>..."}]
     }
    ]
}]';

options

We can now generate the options variable that we will use as argument for our procedure:

SET @options = JSON_OBJECT('mode', 'dryrun', 'policy',
 'disable_unsupported_columns', 
 'external_tables', CAST(@dl_tables AS JSON));

Auto Parallel Load

Lakehouse has the capability to create the table for us and load the data into it. But as we want to explicitly use some specific column names instead of using generic ones, we will use the report to create the table and load the data manually in two different steps. This is why we specified dryrun as mode in the @options definition:

call sys.heatwave_load(@db_list, @options);

We can now retrieve the table’s creation statement and manually modify the name of the columns while creating the table:

 SELECT log->>"$.sql" AS "Load Script"
 FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id\G

Let’s modify all the col_X column with the field names we want:

CREATE TABLE `piday`.`temperature_history_archive`( 
   `id` int unsigned NOT NULL, `time_stamp` timestamp(0) NOT NULL,
   `device_id` varchar(28) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
   `value` decimal(4,2) NOT NULL
) ENGINE=lakehouse SECONDARY_ENGINE=RAPID 
  ENGINE_ATTRIBUTE='{"file": [{"par": "https://...<PAR>..."}], 
"dialect": {"format": "csv", "field_delimiter": ",", "record_delimiter": "\\n"}}';

Once created, we can load the data to the secondary engine:

ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */ 
   `piday`.`temperature_history_archive` SECONDARY_LOAD;

We can verify that the data was loaded correctly:

select count(*) from temperature_history_archive;
+----------+
| count(*) |
+----------+
|  1894194 |
+----------+
1 row in set (0.0299 sec)

And later ?

Now let’s move forward in time and let’s assume we can archive the data present in the partition p2024_01:

select count(*) from temperature_history partition(p2024_01);
+----------+
| count(*) |
+----------+
| 50034435 |
+----------+
1 row in set (0.1562 sec)

We need to dump the data in our Object Storage bucket, unfortunately we need to use a different folder at the dump needs an empty destination. We will use a temporary folder in our bucket:

util.dumpTables("piday",["temperature_history"],"https://<PAR URL>/tmp/",
 {"dialect": "csv","compression":"none",
  "partitions": {"piday.temperature_history": ["p2024_01"]}})
Acquiring global read lock
Global read lock acquired
Initializing - done 
1 tables and 0 views will be dumped.
Gathering information - done 
All transactions have been started
Locking instance for backup
Global read lock has been released
Writing global DDL files
Running data dump using 4 threads.
NOTE: Progress information uses estimated values and may not be accurate.
Writing schema metadata - done       
Writing DDL - done       
Writing table metadata - done       
Starting data dump
40% (50.03M rows / ~124.50M rows), 1.63M rows/s, 104.01 MB/s                 
Dump duration: 00:00:24s
Total duration: 00:00:24s
Schemas dumped: 1
Tables dumped: 1
Data size: 3.44 GB
Rows written: 50034435
Bytes written: 3.44 GB
Average throughput: 141.20 MB/s

This produces a lot of files:

As we only need the csv ones, I will use a fuse module to mount the Object Storage Bucket on my system and use the usual commands to move and delete files (see this post on how to setup s3fs-fuse).

[fred@dell ~] $ mkdir mnt
[fred@dell ~] $ s3fs temperature_history_archive ~/mnt/ -o endpoint=us-ashburn-1 \
-o passwd_file=~/.passwd-ocifs \
-o url=https://xxx.compat.objectstorage.us-ashburn-1.oraclecloud.com/ \
-onomultipart -o use_path_request_style
[fred@dell ~] $ ls mnt
piday@temperature_history@p2023_12@0.csv   tmp
piday@temperature_history@p2023_12@@1.csv
[fred@dell ~/mnt] $ mv tmp/*.csv .
[fred@dell ~/mnt] $ rm -rf tmp

We can now unload and load the data back in Lakehouse:

ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */ 
   `piday`.`temperature_history_archive` SECONDARY_UNLOAD;

ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */ 
   `piday`.`temperature_history_archive` SECONDARY_LOAD;

select count(*) from temperature_history_archive;
+----------+
| count(*) |
+----------+
| 51928629 |
+----------+
1 row in set (0.0244 sec)

We can safely remove the partition from the production table:

alter table temperature_history drop partition p2024_01;

If we don’t need the archive data, we can simply unload it again (and load it back later):

ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */ 
   `piday`.`temperature_history_archive` SECONDARY_UNLOAD;

select count(*) from temperature_history_archive;
ERROR: 3877: Secondary engine operation failed. 
Reason: "Table `piday`.`temperature_history_archive` is not loaded in HeatWave"

Conclusion

In this article, we explored the advantages of utilizing HeatWave Lakehouse to effectively store MySQL data for archiving purposes and reloading it as needed. It is noteworthy to mention that the entire archived dataset, consisting of 51 million records, was loaded from Object Storage within a relatively impressive time frame of 26.58 seconds on my MySQL HeatWave OCI instance.

This can help saving disk space on your MySQL HeatWave instance and increase performance by cleaning up large tables.

Bypassing the creation of the json and idx files, and the possibility to dump data on a non empty destination would be two very nice features for MySQL Shell dump utility.

Enjoy archiving your data in MySQL, HeatWave and Lakehouse !

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

As MySQL Community Manager, I am an employee of Oracle and the views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

You can find articles I wrote on Oracle’s blog.