The ability to store data on Object Storage and retrieve it dynamically when necessary is a notable advantage of Lakehouse when managing MySQL historical data we would like to archive.
Let’s illustrate this with the following table:
CREATE TABLE `temperature_history` (
`id` bigint unsigned NOT NULL AUTO_INCREMENT,
`time_stamp` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`device_id` varchar(30) DEFAULT NULL,
`value` decimal(5,2) NOT NULL DEFAULT '0.00',
`day_date` date GENERATED ALWAYS AS (cast(`time_stamp` as date)) STORED NOT NULL,
PRIMARY KEY (`id`,`day_date`),
KEY `device_id_idx` (`device_id`)
) ENGINE=InnoDB AUTO_INCREMENT=129428417 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci SECONDARY_ENGINE=RAPID
/*!50500 PARTITION BY RANGE COLUMNS(day_date)
(PARTITION p0_before2023_11 VALUES LESS THAN ('2023-11-01') ENGINE = InnoDB,
PARTITION p2023_12 VALUES LESS THAN ('2023-12-01') ENGINE = InnoDB,
PARTITION p2024_01 VALUES LESS THAN ('2024-01-01') ENGINE = InnoDB,
PARTITION p2024_02 VALUES LESS THAN ('2024-02-01') ENGINE = InnoDB,
PARTITION p2024_03 VALUES LESS THAN ('2024-03-01') ENGINE = InnoDB,
PARTITION p2024_04 VALUES LESS THAN ('2024-04-01') ENGINE = InnoDB,
PARTITION p9_future VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB) */
You can notice that the table is also loaded in MySQL HeaWave Cluster (see this previous post).
This table is full of records that were generated by IoT devices:
select count(*) from temperature_history;
+-----------+
| count(*) |
+-----------+
| 129428416 |
+-----------+
1 row in set (0.0401 sec)
Mind the response time 😉
You can also notice that we have partitions. The plan to save disk space and archive the data on cheap storage is the following:
- one a partition is not required anymore, we dump its content to an Object Storage bucket.
- we drop the partition
- if it’s the first time, we create a second archive table from the the HeatWave load report
- if needed we load/unload the data on demand
- we can create a new future partition (optional)
Dumping a Partition to Object Storage
Now the partition with data before December can be archived. Let’s see how much data this represents:
select count(*) from temperature_history partition(p2023_12);
+----------+
| count(*) |
+----------+
| 1894194 |
+----------+
1 row in set (0.0373 sec)
Object Storage Bucket & PAR
Now we need to create a bucket where we will archive the data for the temperature_history
table:
We will use a Pre-Authenticated Request (PAR) to write and read data in Object Storage:
It’s important to allow read and writes and the listing of the objects:
And this is the PAR’s url we will use:
Data Transfer
We use MySQL Shell dumpTables()
utility to copy the data from the partition to Object Storage using the PAR url:
util.dumpTables("piday", ["temperature_history"],
"https://<namespace>.objectstorage.<region>.oci.customer-oci.com/p/<random>/n/<namespave>/b/temperature_history_archive/o/",
{"dialect": "csv", "compression": "none", "partitions":
{"piday.temperature_history": ["p2023_12"]}
})
It’s very important to specify to not compress the files as by default they are.
From OCI Console, we can see all the generated files and we need to keep only the csv
ones:
If you have a very large table with a log of files (chunks), deleting all the .idx
files is a long process, you can then use a tool like Fileon – S3 Browser:
Partition Management
Now that the data is stored in Object Storage, we can delete the partition:
alter table temperature_history drop partition p2023_12;
As we are working on the partition, we can already add an extra one (above optional point 5) using the following syntax:
alter table temperature_history reorganize partition p9_future
into (
partition p2024_05 values less than ('2024-05-01'),
partition p9_future values less than (maxvalue)
);
Archive Table Creation
The first time, we need to create the archived table in which we will load the Object Storage data used for Lakehouse.
Lakehouse
We need to have a HeatWave Cluster with Lakehouse enabled:
We need to prepare the system with the table and partition we want to load. For this operation, we need to set 3 variables:
- db_list: list of the database we will load
- dl_tables: list of the tables we will load and the information related to the format and the Object Storage location
- options: preparation of the arguments for the heatwave_load procedure. We also parse and include the dl_tables variable
db_list
We start by defining the db_list. In our case it’s easy has we only use one database: piday
SET @db_list = '["piday"]';
dl_tables
We need to provide information related to the table we want to create and specify where and how is the table stored:
SET @dl_tables='[{"db_name": "piday","tables":
[
{"table_name": "temperature_history_archive",
"dialect":
{"format": "csv",
"field_delimiter": "\,",
"record_delimiter": "\\n"
},
"file": [{"par": "https://...<the_par_url>..."}]
}
]
}]';
options
We can now generate the options variable that we will use as argument for our procedure:
SET @options = JSON_OBJECT('mode', 'dryrun', 'policy',
'disable_unsupported_columns',
'external_tables', CAST(@dl_tables AS JSON));
Auto Parallel Load
Lakehouse has the capability to create the table for us and load the data into it. But as we want to explicitly use some specific column names instead of using generic ones, we will use the report to create the table and load the data manually in two different steps. This is why we specified dryrun
as mode in the @options
definition:
call sys.heatwave_load(@db_list, @options);
We can now retrieve the table’s creation statement and manually modify the name of the columns while creating the table:
SELECT log->>"$.sql" AS "Load Script"
FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id\G
Let’s modify all the col_X
column with the field names we want:
CREATE TABLE `piday`.`temperature_history_archive`(
`id` int unsigned NOT NULL, `time_stamp` timestamp(0) NOT NULL,
`device_id` varchar(28) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
`value` decimal(4,2) NOT NULL
) ENGINE=lakehouse SECONDARY_ENGINE=RAPID
ENGINE_ATTRIBUTE='{"file": [{"par": "https://...<PAR>..."}],
"dialect": {"format": "csv", "field_delimiter": ",", "record_delimiter": "\\n"}}';
Once created, we can load the data to the secondary engine:
ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */
`piday`.`temperature_history_archive` SECONDARY_LOAD;
We can verify that the data was loaded correctly:
select count(*) from temperature_history_archive;
+----------+
| count(*) |
+----------+
| 1894194 |
+----------+
1 row in set (0.0299 sec)
And later ?
Now let’s move forward in time and let’s assume we can archive the data present in the partition p2024_01
:
select count(*) from temperature_history partition(p2024_01);
+----------+
| count(*) |
+----------+
| 50034435 |
+----------+
1 row in set (0.1562 sec)
We need to dump the data in our Object Storage bucket, unfortunately we need to use a different folder at the dump needs an empty destination. We will use a temporary folder in our bucket:
util.dumpTables("piday",["temperature_history"],"https://<PAR URL>/tmp/",
{"dialect": "csv","compression":"none",
"partitions": {"piday.temperature_history": ["p2024_01"]}})
Acquiring global read lock
Global read lock acquired
Initializing - done
1 tables and 0 views will be dumped.
Gathering information - done
All transactions have been started
Locking instance for backup
Global read lock has been released
Writing global DDL files
Running data dump using 4 threads.
NOTE: Progress information uses estimated values and may not be accurate.
Writing schema metadata - done
Writing DDL - done
Writing table metadata - done
Starting data dump
40% (50.03M rows / ~124.50M rows), 1.63M rows/s, 104.01 MB/s
Dump duration: 00:00:24s
Total duration: 00:00:24s
Schemas dumped: 1
Tables dumped: 1
Data size: 3.44 GB
Rows written: 50034435
Bytes written: 3.44 GB
Average throughput: 141.20 MB/s
This produces a lot of files:
As we only need the csv ones, I will use a fuse module to mount the Object Storage Bucket on my system and use the usual commands to move and delete files (see this post on how to setup s3fs-fuse).
[fred@dell ~] $ mkdir mnt
[fred@dell ~] $ s3fs temperature_history_archive ~/mnt/ -o endpoint=us-ashburn-1 \
-o passwd_file=~/.passwd-ocifs \
-o url=https://xxx.compat.objectstorage.us-ashburn-1.oraclecloud.com/ \
-onomultipart -o use_path_request_style
[fred@dell ~] $ ls mnt
piday@temperature_history@p2023_12@0.csv tmp
piday@temperature_history@p2023_12@@1.csv
[fred@dell ~/mnt] $ mv tmp/*.csv .
[fred@dell ~/mnt] $ rm -rf tmp
We can now unload and load the data back in Lakehouse:
ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */
`piday`.`temperature_history_archive` SECONDARY_UNLOAD;
ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */
`piday`.`temperature_history_archive` SECONDARY_LOAD;
select count(*) from temperature_history_archive;
+----------+
| count(*) |
+----------+
| 51928629 |
+----------+
1 row in set (0.0244 sec)
We can safely remove the partition from the production table:
alter table temperature_history drop partition p2024_01;
If we don’t need the archive data, we can simply unload it again (and load it back later):
ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */
`piday`.`temperature_history_archive` SECONDARY_UNLOAD;
select count(*) from temperature_history_archive;
ERROR: 3877: Secondary engine operation failed.
Reason: "Table `piday`.`temperature_history_archive` is not loaded in HeatWave"
Conclusion
In this article, we explored the advantages of utilizing HeatWave Lakehouse to effectively store MySQL data for archiving purposes and reloading it as needed. It is noteworthy to mention that the entire archived dataset, consisting of 51 million records, was loaded from Object Storage within a relatively impressive time frame of 26.58 seconds on my MySQL HeatWave OCI instance.
This can help saving disk space on your MySQL HeatWave instance and increase performance by cleaning up large tables.
Bypassing the creation of the json and idx files, and the possibility to dump data on a non empty destination would be two very nice features for MySQL Shell dump utility.
Enjoy archiving your data in MySQL, HeatWave and Lakehouse !
Hi, Where is get temperature_history dataset?
This dataset was just generated by an arduino. It’s not available.