MySQL HeatWave as Vector Store and more

MySQL HeatWave integrates GenAI capabilities into MySQL on OCI. We have demonstrated how HeatWave GenAI can leverage RAG’s capability to utilize ingested documents (unstructured data) in LakeHouse and generate responses to specific questions or chats.

See:

The common theme here is the use of data stored in Object Storage (LakeHouse).

I previously discussed how to leverage Oracle GenAI alongside MySQL HeatWave‘s VECTOR capabilities to integrate Large Language Models (LLMs) with the content of a WordPress blog for context.

This involved utilizing external GenAI APIs within an application.

However, what if we could achieve all of this directly within MySQL HeatWave? That is the focus of this article.

Prerequisites

For this exercise, we need to have a MySQL HeatWave instance in OCI as the database for our WordPress (it can also be a backup or a replica).

The MySQL HeatWave instance must be version 9.2 or newer, and a sufficiently large MySQL HeatWave Cluster should be enabled and ready.

Process

First, we need to generate embeddings of our content to generate a RAG.

I will use the same example as my first post on GenAI, my personal blog, using WordPress.

Once the embeddings are created and stored in MySQL HeatWave as vectors, we can apply the same method to generate an embedding for a question:

This allows us to identify relevant articles stored in our database that could potentially answer our question by utilizing a function comparing the distance between our vectors in the database and the generated vector from the question:

Finally, we leverage the HeatWave GenAI RAG function with the articles we have identified as context in order to generate an answer:

Embeddings

To store the vectors from the embeddings, we will create a new table wp_embeddings. As the content from the blog posts might be too large, we will split the content. This means one article will have multiple vectors.

Let’s start by creating the table:

SQL> CREATE TABLE `wp_embeddings` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `vec` vector(1024) DEFAULT NULL,
  `wp_post_id` bigint unsigned DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

Each embedding will have a unique ID and a vector and will point to the WordPress post to which it belongs.

We will use the minilm model to generate our embeddings. If you want to see which models are available, from the embedding task, you can run the following query:

SQL> SELECT * FROM sys.ML_SUPPORTED_LLMS WHERE model_type = "embedding";
+--------------------------------------+------------+
| model_name                           | model_type |
+--------------------------------------+------------+
| minilm                               | embedding  |
| all_minilm_l12_v2                    | embedding  |
| multilingual-e5-small                | embedding  |
| cohere.embed-english-light-v3.0      | embedding  |
| cohere.embed-multilingual-v3.0       | embedding  |
| cohere.embed-multilingual-light-v3.0 | embedding  |
| cohere.embed-english-v3.0            | embedding  |
+--------------------------------------+------------+
7 rows in set (0.0013 sec)

Before creating the embeddings of all the published posts, I will just create the embeddings of the titles.

We will use the function sys.ML_EMBED_ROW():

SQL> SET @embeddOptions = '{"model_id": "minilm"}';
SQL> INSERT INTO wp_embeddings SELECT 0, 
       sys.ML_EMBED_ROW(post_title, @embeddOptions), ID 
     FROM wp_posts WHERE post_type="post" AND post_status="publish";
Query OK, 607 rows affected (6 min 20.0115 sec)

The post_content column presents complexity as it may contain significant text, along with various HTML elements or other code that needs to be removed before embedding the values stored in that column.

We don’t want to use external calls and leave our data inside our MySQL HeatWave instance, so we will use a store procedure and write it in JavaScript!

As we don’t know yet the amount of rows and the amount of parts we will split the content, we will write two different procedures:

the first one loops all the published posts: wp_create_embb_js
the second one will retrieve the content of a post, clean it, split it, and embed the parts: wp_create_row_embb_js

JavaScript Store Procedures

wp_create_embb_js

DROP PROCEDURE IF EXISTS wp_create_embb_js;
CREATE PROCEDURE wp_create_embb_js()
LANGUAGE JAVASCRIPT
AS $$
    let s = session.sql("SELECT count(*) FROM wp_posts where post_type='post' and post_status='publish'")
    let res = s.execute()
    let row = res.fetchOne()
    let total = row[0]
    let batch = 10
    let processed = 0
    let last_id = 0
    while (processed < total) {
        s = session.sql("SELECT ID FROM wp_posts where post_type='post' 
             and post_status='publish' and ID > " + last_id 
             +" ORDER BY ID LIMIT " + batch)
        res = s.execute()
        row = res.fetchOne()
        while(row) {
            let id = row[0]
            last_id = id
            let s2 = session.sql("CALL wp_create_row_embb_js(" + id + ")")
            s2.execute()
            processed += 1
            row = res.fetchOne()
        }
    }
    let stmt_out = session.sql('Select "Embeddings done! ' + total +
        ' posts processed" as "wp_create_embb_js"', {passResultToClient: true})
    stmt_out.execute()

$$;

wp_create_row_embb_js

DROP PROCEDURE IF EXISTS wp_create_row_embb_js;
CREATE PROCEDURE wp_create_row_embb_js(id INT)
LANGUAGE JAVASCRIPT
AS $$
    let s = session.sql("SELECT post_content FROM wp_posts WHERE ID=" + id)
    let res = s.execute()
    let row = res.fetchOne()

    while(row) {
        let content = row[0]
        let content_text = content.replace(/<\/?[^>]+(>|$)/g, "");
        content_text = content_text.replace(/\r\n/g, " ");
        content_text = content_text.replace(/\n/g, " ");
        // replace all double quotes with \" to escape them
        content_text = content_text.replace(/(?<!\\)"/g, '\\"');
        // replace all single quotes with \'
        content_text = content_text.replace(/(?<!\\)'/g, "\\'");
        // replace all lines starting with +- with empty string
        content_text = content_text.replace(/[+-].*[-+]/gm, "")
        // remove all empty lines
        content_text = content_text.replace(/^\s*[\r\n]/gm, "")

        // check if content_text is empty
        if (content_text.length == 0) {
            return "Content is empty"
        }

        // split content_text into strings of 100 characters
        // but not breaking words
        let tokens = []
        let token = ""
        let words = content_text.split(" ")
        for (let i = 0; i < words.length; i++) {
            if (token.length + words[i].length < 100) {
                token += " " + words[i]
            } else {
                tokens.push(token)
                token = words[i]
            }
        }
        tokens.push(token)

        // insert into wp_embeddings for each token
        for (let i = 0; i < tokens.length; i++) {
            // check if tokeks[i] is not empty
            if (tokens[i].length == 0) {
                row = res.fetchOne()
                continue
            }

            if (tokens[i].trim().length > 0) {
                let q = "insert into wp_embeddings values (0, sys.ML_EMBED_ROW(\""+ tokens[i] +"\", '{\"model_id\": \"minilm\", \"truncate\": false}'), " + id + ")"
                let s2 = session.sql(q)
                s2.execute()
            }
        }
        row = res.fetchOne()
    }

    return "Done"

$$;

Let’s see those procedures in action:

SQL>  call wp_create_embb_js;
+--------------------------------------+
| wp_create_embb_js                    |
+--------------------------------------+
| Embeddings done! 607 posts processed |
+--------------------------------------+
1 row in set (1 hour 12 min 52.9051 sec)

We can verify the content of our embeddings stored in the table wp_embeddings:

SQL> select count(*) from wp_embeddings;
+----------+
| count(*) |
+----------+
|     7347 |
+----------+
1 row in set (0.0199 sec)

Distance

We can now see if the distance functions are working with our content.

We first create the embeddings for our question and we store it into a variable:

SQL> select sys.ML_EMBED_ROW("What are roles in MySQL 8?",
      @embeddOptions) into @questionEmb;
Query OK, 1 row affected (0.5705 sec)

Let’s now search for content close to our question:

SQL> select wp_post_id, post_title, 
      min(distance(@questionEmb, vec, 'COSINE')) as distance 
     from wp_embeddings e join wp_posts p 
     where p.ID = e.wp_post_id group by wp_post_id, post_title 
     order by distance limit 3\G
*************************** 1. row ***************************
wp_post_id: 2494
post_title: Some queries related to MySQL Roles
  distance: 0.10369491577148438
*************************** 2. row ***************************
wp_post_id: 1478
post_title: MySQL 8.0: Listing Roles
  distance: 0.19010812044143677
*************************** 3. row ***************************
wp_post_id: 1402
post_title: MySQL 8.0 Roles and Graphml
  distance: 0.19620484113693237
3 rows in set (0.0124 sec)

Yes!

RAG

Everything seems good. We can then try asking an LLM a question using the database content as context.

Once again, we will write a stored routine in JavaScript to make our life easier as it will include the distance search and the RAG:

wp_ask_js

DROP PROCEDURE IF EXISTS wp_ask_js;
CREATE PROCEDURE wp_ask_js(question TEXT)
LANGUAGE JAVASCRIPT
AS $$
    let s = session.sql("SELECT sys.ML_EMBED_ROW(\"" + question + "\", '{\"model_id\": \"minilm\", \"truncate\": false}') into @questionEmb")
    let res = s.execute()
    let q = "select wp_post_id from (select wp_post_id, post_title, min(distance(@questionEmb, vec, 'COSINE')) as distance from wp_embeddings e join wp_posts p where p.ID = e.wp_post_id  group by wp_post_id, post_title order by distance limit 5) a where distance < 0.25"
    s = session.sql(q)
    res = s.execute()
    let row = res.fetchOne()
    let output = []
    while(row) {
        output.push(row)
        row = res.fetchOne()
    }
    if (output.length == 0) {
        let smt_out = session.sql('Select "No results found" as "Asking WordPress with HeatWave GenAI"', {passResultToClient: true})
        smt_out.execute()
        return "No results found"
    }
    let post_contents = []
    for (let i = 0; i < output.length; i++) {
        let wp_post_id = output[i][0]
        let s3 = session.sql("SELECT post_content FROM wp_posts WHERE ID=" + wp_post_id)
        let res3 = s3.execute()
        let row3 = res3.fetchOne()
        let content = row3[0]
        let content_text = content.replace(/<\/?[^>]+(>|$)/g, "");
        content_text = content_text.replace(/\r\n/g, " ");
        content_text = content_text.replace(/\n/g, " ");
        // replace all double quotes with \" to escape them
        content_text = content_text.replace(/(?<!\\)"/g, '\\"');
        // replace all single quotes with \'
        content_text = content_text.replace(/(?<!\\)'/g, "\\'");
        // replace all lines starting with +- with empty string
        content_text = content_text.replace(/[+-].*[-+]/gm, "")
        // remove all empty lines
        content_text = content_text.replace(/^\s*[\r\n]/gm, "")
        post_contents.push(content_text)
    }
    let context_to_send = post_contents.join(" ")
    s = session.sql("SET @inferenceSetup = JSON_OBJECT('task', 'generation', 'model_id', 'mistral-7b-instruct-v1', 'context', \""+ context_to_send +"\");")
    s.execute()
    q = "Select sys.ML_GENERATE(\"" + question + "\", @inferenceSetup)"
    s = session.sql(q)
    res = s.execute()
    row = res.fetchOne()
    let answer = row[0]["text"]
    let smt_out = session.sql('Select "' + answer +'" as "Asking WordPress with HeatWave GenAI"', {passResultToClient: true})
    smt_out.execute()

    return "Done"

$$;

In this procedure, we use mistral-7b-instruct-v1, but we could have used any of the LLM language provided by HeatWave:

SQL> SELECT * FROM sys.ML_SUPPORTED_LLMS WHERE model_type = "generation";
+----------------------------------+------------+
| model_name                       | model_type |
+----------------------------------+------------+
| llama2-7b-v1                     | generation |
| mistral-7b-instruct-v1           | generation |
| llama3-8b-instruct-v1            | generation |
| cohere.command-text-v14          | generation |
| cohere.command-light-text-v14    | generation |
| cohere.command-r-v1:0            | generation |
| cohere.command-r-plus-v1:0       | generation |
| meta.llama3-70b-instruct-v1:0    | generation |
| meta.llama3-1-70b-instruct-v1:0  | generation |
| meta.llama3-1-405b-instruct-v1:0 | generation |
| cohere.command-r-plus            | generation |
| cohere.command-r-16k             | generation |
| meta.llama-3.1-70b-instruct      | generation |
| meta.llama-3.1-405b-instruct     | generation |
| cohere.command-r-08-2024         | generation |
| cohere.command-r-plus-08-2024    | generation |
+----------------------------------+------------+
16 rows in set (0.0013 sec)

We can also use more articles, by increasing the distance limit on line 7 (<0.25).

Now let’s use the procedure:

SQL> call wp_ask_js("How can I split reads and writes with MySQL Router?")\G
*************************** 1. row ***************************
Asking WordPress with HeatWave GenAI:  Answer: MySQL Router can be used to
split reads and writes in a WordPress application by configuring it to use 
Transparent Read/Write Splitting. This involves setting up a MySQL Router 
instance that acts as a proxy between the WordPress application server and 
the MySQL HeatWave Database Service, with the WordPress application server 
configured to connect to the MySQL Router instance instead of the primary 
database instance directly. The MySQL Router instance can then 
automatically direct read and write queries to the appropriate instance 
based on the configuration settings. This allows for automatic load 
balancing and fault tolerance in the WordPress application, without 
requiring any modifications or configuration changes within WordPress 
itself.
1 row in set (11.4867 sec)

To demonstrate that we are indeed using our content, let’s pose a question about a topic not covered in my blog:

SQL> call wp_ask_js("Who won the last World Cup ?")\G
*************************** 1. row ***************************
Asking WordPress with HeatWave GenAI: No results found
1 row in set (0.6156 sec)

Conclusion

In this article, we explored how to leverage Generative AI (GenAI) with the content stored in our MySQL database, specifically utilizing MySQL HeatWave as a vector store for our WordPress blog content, all without the need for an external application. We employed stored routines written in JavaScript to prepare the data. This approach ensures that our data remains within the database, eliminating reliance on any external GenAI APIs, as we utilized HeatWave’s GenAI capabilities instead. This method provides a secure way to enhance the value of our data.

If you’re interested in learning more about the various GenAI possibilities within HeatWave, I highly recommend reading my colleague Oliver’s blog.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.