Using MySQL Database Service in OCI – Part 2: importing Data to MySQL DB System

This post is the second of a series of articles dedicated to MySQL Database Service (MDS):

The goal of this series if to get familiar with MDS, HeatWave and useful MySQL Shell utilities.

Before we start, I would like to highlight that the fastest and recommended way to import data to a MySQL DB System in OCI is to use a parallel dump created using MySQL to Object Storage and load it during the creation of the MDS instance. MDS enables special features to increase the ingestion speed when done during the creation of the instance:

You can find some more information about this process in this article.

Getting the data

As our goal is to get familiar with MDS and HeatWave, we need some consequence amount of data. If you take the Oracle University courses relate to MySQL HeatWave, you will already play with the airportdb database.

We will then use another source of data but stay in the same aviation domain 😉

We will get the data from https://www.transportation.gov/. There are already several other databases using this information as example (like TiDB, ClickHouse, …)

On our compute instance, we will start by downloading all the data (7.4G compressed):

for s in `seq 1987 2021`
do
   for m in `seq 1 12`
   do
    wget -nc --no-check-certificate \
    https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_${s}_${m}.zip
   done
done

Of course, you can specify a smaller range is you want less data.

Importing the data – creating the table

It’s time to create a database/schema and a table to import all those files:

[opc@my-compute ~]$ mysqlsh admin@10.0.1.126 --sql
MySQL Shell 8.0.27

Copyright (c) 2016, 2021, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
Creating a session to 'admin@10.0.1.126'
Fetching schema names for autocompletion... Press ^C to stop.
Your MySQL connection id is 83 (X protocol)
Server version: 8.0.27-u3-cloud MySQL Enterprise - Cloud
No default schema selected; type \use <schema> to set one.
 MySQL > 10.0.1.126:33060+ > 2021-12-16 11:19:55 > 
 SQL > create database airline;

10.0.1.126 is the private IP of my MySQL DB System.

Now we need to create the table in which we will import all the data.

We have two options:

  1. we do it manually by checking the content of one of the file
  2. we use MySQL Shell with a plugin to generate the create statement for us

I choose option 2 !

We need to unzip one of the file to use it as input sample. Pick one, I recommend to not use the last one or some in the first in case the data was not really complete.

I choose On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2019_11.zip:

[opc@my-compute ~]$ unzip On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2019_11.zip
Archive:  On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2019_11.zip
  inflating: On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_11.csv  
  inflating: readme.html

To make is more readable, let me rename the csv file:

[opc@my-compute ~]$ mv *2019_11.csv ontime.csv

Now with MySQL Shell (we installed the plugins in part1):

 Py > schema_utils.create_from_csv(
   -> 'ontime.csv',
   -> first_as_pk=False, limit=1000)
CREATE TABLE ontime (
   `Year` smallint unsigned,
   `Quarter` tinyint unsigned,
   `Month` tinyint unsigned,
   `DayofMonth` tinyint unsigned,
   `DayOfWeek` tinyint unsigned,
   `FlightDate` date,
   `Reporting_Airline` varchar(2),
   `DOT_ID_Reporting_Airline` smallint unsigned,
   `IATA_CODE_Reporting_Airline` varchar(2),
   `Tail_Number` varchar(6),
   `Flight_Number_Reporting_Airline` smallint unsigned,
   `OriginAirportID` smallint unsigned,
   `OriginAirportSeqID` smallint unsigned,
   `OriginCityMarketID` smallint unsigned,
   `Origin` varchar(3),
   `OriginCityName` varchar(29),
   `OriginState` varchar(2),
   `OriginStateFips` tinyint unsigned,
   `OriginStateName` varchar(14),
   `OriginWac` tinyint unsigned,
   `DestAirportID` smallint unsigned,
   `DestAirportSeqID` smallint unsigned,
   `DestCityMarketID` smallint unsigned,
   `Dest` varchar(3),
   `DestCityName` varchar(25),
   `DestState` varchar(2),
   `DestStateFips` tinyint unsigned,
   `DestStateName` varchar(14),
   `DestWac` tinyint unsigned,
   `CRSDepTime` smallint unsigned,
   `DepTime` smallint unsigned,
   `DepDelay` decimal(4,2),
   `DepDelayMinutes` decimal(4,2),
   `DepDel15` decimal(4,2),
   `DepartureDelayGroups` tinyint unsigned,
   `DepTimeBlk` varchar(9),
   `TaxiOut` decimal(4,2),
   `WheelsOff` smallint unsigned,
   `WheelsOn` smallint unsigned,
   `TaxiIn` decimal(3,2),
   `CRSArrTime` smallint unsigned,
   `ArrTime` smallint unsigned,
   `ArrDelay` decimal(3,2),
   `ArrDelayMinutes` decimal(4,2),
   `ArrDel15` decimal(4,2),
   `ArrivalDelayGroups` tinyint ,
   `ArrTimeBlk` varchar(9),
   `Cancelled` decimal(4,2),
   `CancellationCode` varchar(1),
   `Diverted` decimal(4,2),
   `CRSElapsedTime` decimal(5,2),
   `ActualElapsedTime` decimal(5,2),
   `AirTime` decimal(5,2),
   `Flights` decimal(3,2),
   `Distance` decimal(6,2),
   `DistanceGroup` tinyint unsigned,
   `CarrierDelay` varchar(6),
   `WeatherDelay` varchar(6),
   `NASDelay` varchar(5),
   `SecurityDelay` varchar(4),
   `LateAircraftDelay` varchar(6),
   `FirstDepTime` varchar(4),
   `TotalAddGTime` varchar(5),
   `LongestAddGTime` varchar(5),
   `DivAirportLandings` tinyint unsigned,
   `DivReachedDest` varchar(0),
   `DivActualElapsedTime` varchar(0),
   `DivArrDelay` varchar(0),
   `DivDistance` varchar(0),
   `Div1Airport` varchar(0),
   `Div1AirportID` varchar(0),
   `Div1AirportSeqID` varchar(0),
   `Div1WheelsOn` varchar(0),
   `Div1TotalGTime` varchar(0),
   `Div1LongestGTime` varchar(0),
   `Div1WheelsOff` varchar(0),
   `Div1TailNum` varchar(0),
   `Div2Airport` varchar(0),
   `Div2AirportID` varchar(0),
   `Div2AirportSeqID` varchar(0),
   `Div2WheelsOn` varchar(0),
   `Div2TotalGTime` varchar(0),
   `Div2LongestGTime` varchar(0),
   `Div2WheelsOff` varchar(0),
   `Div2TailNum` varchar(0),
   `Div3Airport` varchar(0),
   `Div3AirportID` varchar(0),
   `Div3AirportSeqID` varchar(0),
   `Div3WheelsOn` varchar(0),
   `Div3TotalGTime` varchar(0),
   `Div3LongestGTime` varchar(0),
   `Div3WheelsOff` varchar(0),
   `Div3TailNum` varchar(0),
   `Div4Airport` varchar(0),
   `Div4AirportID` varchar(0),
   `Div4AirportSeqID` varchar(0),
   `Div4WheelsOn` varchar(0),
   `Div4TotalGTime` varchar(0),
   `Div4LongestGTime` varchar(0),
   `Div4WheelsOff` varchar(0),
   `Div4TailNum` varchar(0),
   `Div5Airport` varchar(0),
   `Div5AirportID` varchar(0),
   `Div5AirportSeqID` varchar(0),
   `Div5WheelsOn` varchar(0),
   `Div5TotalGTime` varchar(0),
   `Div5LongestGTime` varchar(0),
   `Div5WheelsOff` varchar(0),
   `Div5TailNum` varchar(0)
);

I limited the rows to parse1000 entries (there are 602454 in that file).

As you can see, there is no Primary Key and this is mandatory with HeatWave (and for HA, and recommended with InnoDB). It’s a very bad practice to not have a Primary Key when using MySQL, without Primary Key when we would have tried to use that table in HeatWave we would have seen this message:

Hopefully, to not break any eventual compatibility with any application, MySQL has the possibility to use an invisible column as Primary Key. This is what we gonna do. Let’s run again the MySQL Shell command but this time we ask to generate a new PK for us:

 Py > schema_utils.create_from_csv(
   -> 'On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2020_5.csv',
   -> pk_auto_inc=True, first_as_pk=False, limit=1000)

CREATE TABLE On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2020_5 (
   id int unsigned auto_increment invisible primary key,
   `Year` smallint unsigned,
   `Quarter` tinyint unsigned,
   `Month` tinyint unsigned,
   `DayofMonth` tinyint unsigned,
   `DayOfWeek` tinyint unsigned,
   `FlightDate` date,
   ...
   `Div5TailNum` varchar(0)
);

Let’s now create the table, I will modify the generated SQL create statement to name the table as ontime. I will also modify all VARCHAR(0)to VARCHAR(10) in case those columns will be use in one of the other files and some ID’s were smallint, I modified them to int:

 SQL > CREATE TABLE ontime (
   id int unsigned auto_increment invisible primary key,
   `Year` smallint unsigned,
   `Quarter` tinyint unsigned,
   `Month` tinyint unsigned,
   ...
   `OriginAirportID` int unsigned,
   `OriginAirportSeqID` int unsigned,
   `OriginCityMarketID` int unsigned,
    ...
   `Div5TailNum` varchar(10)
);
Query OK, 0 rows affected (0.0191 sec)

 SQL > show tables;
+-------------------+
| Tables_in_airline |
+-------------------+
| ontime            |
+-------------------+
1 row in set (0.0013 sec)

Depending on the sample file you used to generate the CREATE TABLE statement, some adjustments might be necessary.

Importing the data – load the data

As the table is created, we can now load the file into our MySQL DB System.

We uncompress all the files one by one and load them one by one using MySQL Shell Parallel Import Table Utility with 4 threads (my compute instance has 4 cores):

[opc@my-compute ~]$ for i in `ls On_Time*zip`
do  
  unzip -o $i  
  file_csv=$(ls *\(*.csv | sed 's/(//; s/)//')
  mv *.csv "$file_csv"
  mysqlsh mysql://admin@10.0.1.126/airline -- \
  util import-table "${file_csv}" \
  --table="ontime" --dialect="csv-unix" \
  --skipRows=1 --showProgress --threads=4
  rm -f "${file_csv}"
done

This will take some minutes to complete…

Let’s have a look at how much data we imported:

 SQL > select count(*) from ontime;
+-----------+
| count(*)  |
+-----------+
| 201575308 |
+-----------+
1 row in set (33.8058 sec)

In the next article, we will see how to enable a HeatWave Cluster to increase what matters the most: query response time !

Enjoy MySQL, MySQL Database Service & MySQL Shell !

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

4 Comments

Leave a Reply to lefredCancel Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.