Schedule Meeting

a

How to query a REST API with MariaDB CONNECT engine

by | Apr 13, 2024 | MariaDB

Need Help?  Click Here for Expert Support

Querying a remote REST API directly from the database: does it sound like pure madness? Some people will surely answer that yes, it does. So, first of all, let me spend a paragraph to show some great benefits of doing so. Then I’ll show you how to query a REST API with MariaDB CONNECT storage engine. I will explore a practical use case with Hashicorp Releases API.

The benefits of querying a REST API in SQL

The most obvious use case is for data analysts, data scientists or product managers. Some of them can’t program at all, but chances are that they know SQL and use it in their job. So they can’t write a script to gather the data they need from an API. But if they can just run an SQL query, their problem is solved. In no time, without involving anyone else.

Now to data engineers, developers and DevOps. Sometimes applications read data from a remote API. This means that, for every used API, some developers will have to write some code. When a very similar result can be achieved by writing a CREATE TABLE statement, I call it a productivity problem.

Another problem is about optimising a process. The typical approach is that an application retrieves some data from an API call, transforms them, and writes them into a database. Many things can go wrong, corrupting data or preventing an update. The process might be too slow, because data have to travel twice and the application has to process them. And sending some types of data to/from the application might increase security risks if your data is sensitive.

MariaDB CONNECT allows us to interact with a huge variety of data sources. This includes other databases, data files in many formats, and REST APIs. Every time you send a SELECT to a CONNECT table backed by a REST API, data will be retrieved via HTTP or HTTPS, and will be presented to the client in the form of a normal resultset.

This is obviously slower than querying local data. And it’s theoretically slower than querying an API directly from an application. But often the purpose is to populate a table with remote data. This table can be a regular InnoDB table, which will answer queries quickly. And the application can launch an ALTER TABLE ... ENGINE or INSERT ... SELECT query to move the data directly from the remote API to the local database. This will be showed later in this article.

There are also cases when the performance difference doesn’t matter. In the case of a data analyst, it should be ok to wait for more time to run a query, as long as running a query is all they need to do to read data from the API.

Querying the Hashicop Releases API in SQL

Let’s use progressively more complex examples so explore interesting CONNECT features.

The examples are based on the Hashicorp Release API, that can be used to find out whether HashiCorp software needs to be upgraded.

Exmple 1: Querying an array of strings

Let’s start by creating a table that contains very simple results from a REST API call. It returns an array of strings. In MariaDB, we’ll want to see it as a one-column table, where each string is a row.

CREATE OR REPLACE TABLE hashicorp_product (
    name VARCHAR(100) NOT NULL
)
    ENGINE = CONNECT,
    TABLE_TYPE = JSON,
    HTTP = 'https://api.releases.hashicorp.com/v1/products'
;
  • ENGINE: We’re specifying that the table must use the CONNECT engine, rather than the default engine (normally InnoDB).
  • TABLE_TYPE: CONNECT can handle a big range of data sources and formats, so we need to specify which one needs to be used in this case: JSON.
  • HTTP: For JSON tables, we can specify a URL. An HTTP call with the GET method, is expected to return the data we want to query. HTTPS is supported, as showed in the example below. Querystrings are not used in our examples, but they are supported, too.

Example 2: Array of objects, using table discovery

The most common case is REST APIs returning an array of objects. In MariaDB, we want to see every object as a row.

Objects are key/value data structures where each property is a column. Therefore CONNECT is able to create a table even if we don’t specify the columns we want! Each property name will be used as a column name, and it will infer the right types to use. This functionality is called table discovery, and there some other storage engines are able to use it, too.

CREATE OR REPLACE TABLE consul_release
    ENGINE = CONNECT,
    TABLE_TYPE = JSON,
    HTTP = 'https://api.releases.hashicorp.com/v1/releases/consul'
;

Now, let’s see the resulting table structure:

> SHOW CREATE TABLE consul_release \G
*************************** 1. row ***************************
       Table: consul_release
Create Table: CREATE TABLE `consul_release` (
  `builds_arch` char(5) NOT NULL `JPATH`='$.builds[0].arch',
  `builds_os` char(6) NOT NULL `JPATH`='$.builds[0].os',
  `builds_url` char(100) NOT NULL `JPATH`='$.builds[0].url',
  `is_prerelease` tinyint(1) NOT NULL `JPATH`='$.is_prerelease',
  `license_class` char(10) NOT NULL `JPATH`='$.license_class',
  `name` char(6) NOT NULL `JPATH`='$.name',
  `status_state` char(9) NOT NULL `JPATH`='$.status.state',
  `status_timestamp_updated` char(24) NOT NULL `JPATH`='$.status.timestamp_updated',
  `timestamp_created` char(24) NOT NULL `JPATH`='$.timestamp_created',
  `timestamp_updated` char(24) NOT NULL `JPATH`='$.timestamp_updated',
  `url_changelog` char(79) NOT NULL `JPATH`='$.url_changelog',
  `url_docker_registry_dockerhub` char(52) NOT NULL `JPATH`='$.url_docker_registry_dockerhub',
  `url_docker_registry_ecr` char(51) NOT NULL `JPATH`='$.url_docker_registry_ecr',
  `url_license` char(53) NOT NULL `JPATH`='$.url_license',
  `url_project_website` char(37) NOT NULL `JPATH`='$.url_project_website',
  `url_release_notes` char(40) NOT NULL `JPATH`='$.url_release_notes',
  `url_shasums` char(95) NOT NULL `JPATH`='$.url_shasums',
  `url_shasums_signatures` char(99) NOT NULL `JPATH`='$.url_shasums_signatures[0]',
  `url_source_repository` char(35) DEFAULT NULL `JPATH`='$.url_source_repository',
  `version` char(19) NOT NULL `JPATH`='$.version'
) ENGINE=CONNECT DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_uca1400_ai_ci `TABLE_TYPE`='JSON' `HTTP`='https://api.releases.hashicorp.com/v1/releases/consul'

Interesting, isn’t it? Let’s look at the first column, as an example:

`builds_arch` char(5) NOT NULL `JPATH`='$.builds[0].arch',

We can see that every row-object contains a nested array called builds that contains objects with these properties: arch, os, url. By default, the related columns are colled builds_arch, builds_os and builds_url, and they contain the values from the first element of the builds array.

CONNECT uses a purpose-built language called JPath to specify this. If you know JSONPath you won’t find JPATH hard to understand.

Example 3: Array of objects, specifying columns manually

Let’s use the above example again. But suppose that the table we want to build is different from the one created by default by MariaDB. We can run a statement like this:

CREATE OR REPLACE TABLE consul_release (
    version VARCHAR(100) NOT NULL,
    not_stable VARCHAR(100) NOT NULL
        FIELD_FORMAT = 'is_prerelease',
    state VARCHAR(100) NOT NULL
        FIELD_FORMAT = 'status.state',
    timestamp_created TIMESTAMP NOT NULL
)
    ENGINE = CONNECT,
    TABLE_TYPE = JSON,
    HTTP = 'https://api.releases.hashicorp.com/v1/releases/consul'
;

Now, here’s what we did:

  • We only specified a few columns. Some JSON properties will simply be ignored and won’t be part of the table.
  • The version and timestamp_created columns have the same name as the matching properties. So there is no need to specify where these values should come from.
  • The not_stable column has the same value as the is_prerelease property, that is not nested.
  • The state column has the same values as the state property, nested in an object property called status.

Let’s query the table:

> SELECT * FROM consul_release;
+---------------------+------------+-----------+---------------------+
| version             | not_stable | state     | timestamp_created   |
+---------------------+------------+-----------+---------------------+
| 1.16.5+ent.fips1402 | false      | supported | 2024-01-23 20:40:10 |
| 1.16.5+ent          | false      | supported | 2024-01-23 20:40:05 |
| 1.16.5              | false      | supported | 2024-01-23 20:31:17 |
| 1.15.9+ent          | false      | supported | 2024-01-23 20:28:28 |
| 1.17.2+ent.fips1402 | false      | supported | 2024-01-23 20:26:59 |
| 1.17.2+ent          | false      | supported | 2024-01-23 20:26:53 |
| 1.15.9              | false      | supported | 2024-01-23 20:23:38 |
| 1.17.2              | false      | supported | 2024-01-23 20:23:21 |
| 1.17.1+ent.fips1402 | false      | supported | 2023-12-14 23:16:04 |
| 1.17.1+ent          | false      | supported | 2023-12-14 23:15:57 |
+---------------------+------------+-----------+---------------------+

Note that only one status nested object exists (it’s not in an array).

Example 4: Expanding a nested array

Now let’s discuss a different case. As mentioned before, the builds property is an array. So, for every table row, an array exists with multiple values.

Suppose we want to expand it, and change the logic: we want to have a row for each builds element. Values that are outside that array will be repeated multiple times.

CREATE OR REPLACE TABLE consul_build (
    version VARCHAR(100) NOT NULL,
    arch VARCHAR(100) NOT NULL
        FIELD_FORMAT = 'builds.[*].arch',
    os VARCHAR(100) NOT NULL
        FIELD_FORMAT = 'builds[*].os',
    timestamp_created TIMESTAMP NOT NULL
)
    ENGINE = CONNECT,
    TABLE_TYPE = JSON,
    HTTP = 'https://api.releases.hashicorp.com/v1/releases/consul'
;

As you can see, we specified builds.[*].arch. This is a JPATH syntax which wouldn’t work in JSONPath. And it means that the builds array must be expanded. Here’s a query example:

> SELECT * FROM consul_build
  ORDER BY LEFT(version, 6) DESC
  FETCH FIRST 10 ROWS WITH TIES;
+---------------------+-------+---------+---------------------+
| version             | arch  | os      | timestamp_created   |
+---------------------+-------+---------+---------------------+
| 1.17.2+ent.fips1402 | amd64 | linux   | 2024-01-23 20:26:59 |
| 1.17.2+ent.fips1402 | arm64 | linux   | 2024-01-23 20:26:59 |
| 1.17.2+ent.fips1402 | amd64 | windows | 2024-01-23 20:26:59 |
| 1.17.2+ent          | amd64 | darwin  | 2024-01-23 20:26:53 |
| 1.17.2+ent          | arm64 | darwin  | 2024-01-23 20:26:53 |
| 1.17.2+ent          | 386   | freebsd | 2024-01-23 20:26:53 |
| 1.17.2+ent          | amd64 | freebsd | 2024-01-23 20:26:53 |
| 1.17.2+ent          | 386   | linux   | 2024-01-23 20:26:53 |
| 1.17.2+ent          | amd64 | linux   | 2024-01-23 20:26:53 |
| 1.17.2+ent          | arm   | linux   | 2024-01-23 20:26:53 |
| 1.17.2+ent          | arm64 | linux   | 2024-01-23 20:26:53 |
| 1.17.2+ent          | s390x | linux   | 2024-01-23 20:26:53 |
| 1.17.2+ent          | amd64 | solaris | 2024-01-23 20:26:53 |
| 1.17.2+ent          | 386   | windows | 2024-01-23 20:26:53 |
| 1.17.2+ent          | amd64 | windows | 2024-01-23 20:26:53 |
| 1.17.2              | amd64 | darwin  | 2024-01-23 20:23:21 |
| 1.17.2              | arm64 | darwin  | 2024-01-23 20:23:21 |
| 1.17.2              | 386   | freebsd | 2024-01-23 20:23:21 |
| 1.17.2              | amd64 | freebsd | 2024-01-23 20:23:21 |
| 1.17.2              | 386   | linux   | 2024-01-23 20:23:21 |
| 1.17.2              | amd64 | linux   | 2024-01-23 20:23:21 |
| 1.17.2              | arm   | linux   | 2024-01-23 20:23:21 |
| 1.17.2              | arm64 | linux   | 2024-01-23 20:23:21 |
| 1.17.2              | amd64 | solaris | 2024-01-23 20:23:21 |
| 1.17.2              | 386   | windows | 2024-01-23 20:23:21 |
| 1.17.2              | amd64 | windows | 2024-01-23 20:23:21 |
+---------------------+-------+---------+---------------------+

If you don’t understand the WITH TIES syntax, see MariaDB: WITH TIES syntax.

As you can see, a row was returned for every element of builds. Builds of the same release share the same values for version and timestamp_created.

Example 5: Transforming values

Sometimes the values returned by a remote API are not convenient for our applications. In this case, we may want to process and transform some values. This is something we can do with some MariaDB features that are not specific to CONNECT.

Let’s see an example first:

CREATE OR REPLACE TABLE consul_release (
    version VARCHAR(100) NOT NULL,
    state VARCHAR(100) NOT NULL
        FIELD_FORMAT = 'status.state',
    is_prerelease BOOL NOT NULL DEFAULT 0 INVISIBLE,
    is_stable BOOL GENERATED ALWAYS AS (NOT is_prerelease) VIRTUAL,
    timestamp_created TIMESTAMP NOT NULL
)
    ENGINE = CONNECT,
    TABLE_TYPE = JSON,
    HTTP = 'https://api.releases.hashicorp.com/v1/releases/consul'
;
  • is_prerelease is the column that contains the original value returned by the API. It is defined as an invisible column.
  • is_prerelase has a default value. This is required for invisible columns, but it has no effect for CONNECT JSON tables.
  • is_stable is a generated column, which contains the result off an expression (NOT is_prerelease).

SELECT * FROM consul_release will return the is_stable column, but not the is_prerelease column. It will still be possible to require this column explicitly by running:

SELECT *, is_prerelease FROM conssul_release;

Example 6: Querying a details API method with arguments

Many REST APIs have two types of GET methods:

  1. To retrieve an array of objects (list);
  2. To retrieve a single object (details).

Above we used two methods of the first type. Now we’re approaching a method of the second type.

And we have a problem: we have a single parametrised URL to call for any object we want to see. But for CONNECT there’s no such thing as parameters in URLs, so every API call should be handled by a different table. And we don’t want to have a CONNECT table for every release of every HashiCorp product.

What can we do about it? Well, at a low level we’ll actually have those tables. But we can hide the mechanism that creates and queries the table inside a stored procedure.

DELIMITER ||
CREATE OR REPLACE PROCEDURE select_consul_release(IN i_release VARCHAR(100))
    NOT DETERMINISTIC
    MODIFIES SQL DATA
BEGIN
    DECLARE v_sql TEXT;
    DECLARE v_table_name VARCHAR(64) DEFAULT CONCAT('consul_release_', CONNECTION_ID());
    SET v_sql := SFORMAT('
    CREATE OR REPLACE TABLE {} (
        version VARCHAR(100) NOT NULL,
        state VARCHAR(100) NOT NULL
            FIELD_FORMAT = ''status.state'',
        is_prerelease BOOL NOT NULL DEFAULT 0 INVISIBLE,
        is_stable BOOL GENERATED ALWAYS AS (NOT is_prerelease) VIRTUAL,
        timestamp_created TIMESTAMP NOT NULL
    )
        ENGINE = CONNECT,
        TABLE_TYPE = JSON,
        HTTP = ''https://api.releases.hashicorp.com/v1/releases/consul/{}''
    ;
    ', v_table_name, i_release);
    EXECUTE IMMEDIATE v_sql;

    EXECUTE IMMEDIATE CONCAT('SELECT * FROM ', v_table_name, ';');
    EXECUTE IMMEDIATE CONCAT('DROP TABLE ', v_table_name, ';');
END;
||
DELIMITER ;

This procedure creates a CONNECT table from a REST URL that contains the desired argument (the release name). Then it runs a SELECT to return its contents, and destroys the table. If multiple connections are likely to use this procedure concurrently, they won’t interfere with each other because the table name contains the current connection id.

Alternatively, you might want multiple connections to share existing CONNECT tables. To do so, use CREATE TABLE IF NOT EXISTS (rather than OR REPLACE) and use the release name as part of the table name (rather than the connection id).

Proof that it works:

> CALL select_consul_release('1.17.2');
+---------+-----------+-----------+---------------------+
| version | state     | is_stable | timestamp_created   |
+---------+-----------+-----------+---------------------+
| 1.17.2  | supported |         1 | 2024-01-23 20:23:21 |
+---------+-----------+-----------+---------------------+

These procedures won’t work in MySQL because they use the unique MariaDB features:

  • The CONNECT engine, obviously;
  • CREATE OR REPLACE TABLE, which is similar to DROP TABLE + CREATE TABLE, but it’s atomic and less verbose;
  • SFORMAT(), to compose strings by interpolation (see How to compose strings in MariaDB);
  • EXECUTE IMMEDIATE, which has some limitations but can normally be used as a shortcut for PREPARE + EXECUTE + DEALLOCATE PREPARE.

Materialising a CONNECT table

We mentioned before that a CONNECT table can be materialised. This means that it becomes a regular table, which normally uses the InnoDB storage engine, and the remote data is written to the local database.

One off materialisation

The simplest and fastest way to materialise a CONNECT table is the following:

ALTER TABLE table_name ENGINE = InnoDB;
RENAME TABLE table_name TO new_name;

I don’t like to give numbers that might depend on a lot of factors, but in most cases it will take less than 5 seconds even for tables that contain a few millions of rows.

Incremental materialisation

Your logic might be more complex than this. Maybe you want to incrementally import data from a remote API every night. This means that you can’t create a new table every time.

The first time you import data you will need to create the table with the current data. You can use the method I have shown above, for a one off materialisation.

Incremental imports can be done in this way:

INSERT INTO materialised_table
    (column1, column2, column3)
    SELECT column1, column2, column3 FROM connect_table
;

JSON file

I left this method as the last one because it’s rarely useful. But you might want to know about it if you need a JSON file, and anyway it tells us something about how CONNECT works internally.

You might have noticed that all the above CONNECT tables use the JSON TABLE_TYPE. This table type was originally designed to use JSON files as relational tables. To do this, one has to specify a file name:

CREATE TABLE table_name
...
ENGINE = CONNECT,
TABLE_TYPE = JSON,
FILE_NAME = 'filename.json';

If the file exists in the data directory, it will be used. If not, it will be created.

But even if we didn’t specify a file name, one was created anyway. If you ran the CREATE TABLEs above, you probably noticed warnings like this:

*************************** 1. row ***************************
  Level: Warning
   Code: 1105
Message: No file name. Table will use hashicorp_product.json

And if you look in the data directory, in the database subdirectory, we’ll see this:

# ls -l /var/lib/mysql/test/hashicorp_product.*
-rw-rw---- 1 mysql mysql 907 Jan 27 14:23 /var/lib/mysql/test/hashicorp_product.frm
-rw-rw---- 1 mysql mysql   0 Jan 27 14:23 /var/lib/mysql/test/hashicorp_product.json

Initially the file is empty. After we run a query, it fill be filled with the output received from the REST API, regardless any WHERE in our query:

# cat hashicorp_product.json 
["athena-cli","atlas-upload-cli",...]

So when we query a CONNECT JSON table, CONNECT will check the HTTP table option. If it was specified, CONNECT will call that URL and populate (or overwrite) the JSON file. The rest of the operations are performed as if it was a regular file-based CONNECT table.

Limitations

CONNECT JSON tables have some important limitations:

  • Only the GET method is supported. This is theoretically correct because only GET should be used to read data, but in practice some APIs might require a different method.
  • Some APIs require an additional call for authorisation. CONNECT has no way to make an additional call.
  • No way to specify an HTTP(S) header to add to the requests. This is often required for authorisation (you call an authorisation method, you receive back a token valid for a period of time, and you add that token to the next requests, in the Autorization HTTP header). But these features are independent: tokens are not always passed in an HTTP header, and HTTP headers have other uses too.
  • There is no way to see the response HTTP code, nor any other response headers (useful for redirects and errors).
  • If a token or an API key is written in the URL, it will be readable with the SHOW CREATE TABLE command or by querying the information_schema.TABLES table.
  • Every URL is unique, even if it contains parameters. We can’t use the WHERE clause to modify the actual URL that is called.
  • Some APIs have a usage limit. For example they might accept no more than 10 calls per hour, or 100 per day. While the contents of an API can be materialised as discussed above, it would be very handy if CONNECT could cache results of an API call for a period of time.

Some of these limitations can be removed or lifted by implementing the necessary features as stored procedures. I will demonstrate this in a future article.

Conclusions

The CONNECT storage engine allows easy data integration between MariaDB and other technologies. In this article we saw how to use CONNECT to query a REST API. We discussed several examples, progressively more complex, which covers the majority of the problems that you might have when working with CONNECT and a REST API.

Should you have questions, or problems with CONNECT, don’t hesitate to contact us for a consultation.

Federico Razzoli

All content in this blog is distributed under the CreativeCommons Attribution-ShareAlike 4.0 International license. You can use it for your needs and even modify it, but please refer to Vettabase and the author of the original post. Read more about the terms and conditions: https://creativecommons.org/licenses/by-sa/4.0/

About Federico Razzoli
Federico Razzoli is a database professional, with a preference for open source databases, who has been working with DBMSs since year 2000. In the past 20+ years, he served in a number of companies as a DBA, Database Engineer, Database Consultant and Software Developer. In 2016, Federico summarized his extensive experience with MariaDB in the “Mastering MariaDB” book published by Packt. Being an experienced database events speaker, Federico speaks at professional conferences and meetups and conducts database trainings. He is also a supporter and advocate of open source software. As the Director of Vettabase, Federico does business worldwide but prefers to do it from Scotland where he lives.

Recent Posts

Validating rows with CHECK constraints in MariaDB

Validating rows with CHECK constraints in MariaDB

Relational databases provide several ways to validate data. CHECK constraints are a powerful tool for in-database data validation. Their impact on performance is minimal, if any. In this article we'll discuss MariaDB support for CHECK constraints. Note that the CHECK...

MariaDB ColumnStore SQL limitations

MariaDB ColumnStore SQL limitations

MariaDB ColumnStore is an extremely fast and scalable solution for analytics built on MariaDB, which Vettabase supports. MariaDB ColumnStore inherits the MariaDB SQL dialect, and many MariaDB features. However, numerous MariaDB features are not available for the...

Installing MariaDB ColumnStore on Ubuntu, for production

Installing MariaDB ColumnStore on Ubuntu, for production

Let's see how to install MariaDB ColumnStore (single node) on Ubuntu. Let's also prepare the system to run ColumnStore with good performance, as we should always do in production. The following steps should work on any modern Ubuntu version up to and including 24.04...

Services

Need Help?  Click Here for Expert Support

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *