Querying a remote REST API directly from the database: does it sound like pure madness? Some people will surely answer that yes, it does. So, first of all, let me spend a paragraph to show some great benefits of doing so. Then I’ll show you how to query a REST API with MariaDB CONNECT storage engine. I will explore a practical use case with Hashicorp Releases API.
The benefits of querying a REST API in SQL
The most obvious use case is for data analysts, data scientists or product managers. Some of them can’t program at all, but chances are that they know SQL and use it in their job. So they can’t write a script to gather the data they need from an API. But if they can just run an SQL query, their problem is solved. In no time, without involving anyone else.
Now to data engineers, developers and DevOps. Sometimes applications read data from a remote API. This means that, for every used API, some developers will have to write some code. When a very similar result can be achieved by writing a CREATE TABLE
statement, I call it a productivity problem.
Another problem is about optimising a process. The typical approach is that an application retrieves some data from an API call, transforms them, and writes them into a database. Many things can go wrong, corrupting data or preventing an update. The process might be too slow, because data have to travel twice and the application has to process them. And sending some types of data to/from the application might increase security risks if your data is sensitive.
MariaDB CONNECT allows us to interact with a huge variety of data sources. This includes other databases, data files in many formats, and REST APIs. Every time you send a SELECT to a CONNECT table backed by a REST API, data will be retrieved via HTTP or HTTPS, and will be presented to the client in the form of a normal resultset.
This is obviously slower than querying local data. And it’s theoretically slower than querying an API directly from an application. But often the purpose is to populate a table with remote data. This table can be a regular InnoDB table, which will answer queries quickly. And the application can launch an ALTER TABLE ... ENGINE
or INSERT ... SELECT
query to move the data directly from the remote API to the local database. This will be showed later in this article.
There are also cases when the performance difference doesn’t matter. In the case of a data analyst, it should be ok to wait for more time to run a query, as long as running a query is all they need to do to read data from the API.
Querying the Hashicop Releases API in SQL
Let’s use progressively more complex examples so explore interesting CONNECT features.
The examples are based on the Hashicorp Release API, that can be used to find out whether HashiCorp software needs to be upgraded.
Exmple 1: Querying an array of strings
Let’s start by creating a table that contains very simple results from a REST API call. It returns an array of strings. In MariaDB, we’ll want to see it as a one-column table, where each string is a row.
CREATE OR REPLACE TABLE hashicorp_product (
name VARCHAR(100) NOT NULL
)
ENGINE = CONNECT,
TABLE_TYPE = JSON,
HTTP = 'https://api.releases.hashicorp.com/v1/products'
;
ENGINE
: We’re specifying that the table must use the CONNECT engine, rather than the default engine (normally InnoDB).TABLE_TYPE
: CONNECT can handle a big range of data sources and formats, so we need to specify which one needs to be used in this case: JSON.HTTP
: For JSON tables, we can specify a URL. An HTTP call with the GET method, is expected to return the data we want to query. HTTPS is supported, as showed in the example below. Querystrings are not used in our examples, but they are supported, too.
Example 2: Array of objects, using table discovery
The most common case is REST APIs returning an array of objects. In MariaDB, we want to see every object as a row.
Objects are key/value data structures where each property is a column. Therefore CONNECT is able to create a table even if we don’t specify the columns we want! Each property name will be used as a column name, and it will infer the right types to use. This functionality is called table discovery, and there some other storage engines are able to use it, too.
CREATE OR REPLACE TABLE consul_release
ENGINE = CONNECT,
TABLE_TYPE = JSON,
HTTP = 'https://api.releases.hashicorp.com/v1/releases/consul'
;
Now, let’s see the resulting table structure:
> SHOW CREATE TABLE consul_release \G
*************************** 1. row ***************************
Table: consul_release
Create Table: CREATE TABLE `consul_release` (
`builds_arch` char(5) NOT NULL `JPATH`='$.builds[0].arch',
`builds_os` char(6) NOT NULL `JPATH`='$.builds[0].os',
`builds_url` char(100) NOT NULL `JPATH`='$.builds[0].url',
`is_prerelease` tinyint(1) NOT NULL `JPATH`='$.is_prerelease',
`license_class` char(10) NOT NULL `JPATH`='$.license_class',
`name` char(6) NOT NULL `JPATH`='$.name',
`status_state` char(9) NOT NULL `JPATH`='$.status.state',
`status_timestamp_updated` char(24) NOT NULL `JPATH`='$.status.timestamp_updated',
`timestamp_created` char(24) NOT NULL `JPATH`='$.timestamp_created',
`timestamp_updated` char(24) NOT NULL `JPATH`='$.timestamp_updated',
`url_changelog` char(79) NOT NULL `JPATH`='$.url_changelog',
`url_docker_registry_dockerhub` char(52) NOT NULL `JPATH`='$.url_docker_registry_dockerhub',
`url_docker_registry_ecr` char(51) NOT NULL `JPATH`='$.url_docker_registry_ecr',
`url_license` char(53) NOT NULL `JPATH`='$.url_license',
`url_project_website` char(37) NOT NULL `JPATH`='$.url_project_website',
`url_release_notes` char(40) NOT NULL `JPATH`='$.url_release_notes',
`url_shasums` char(95) NOT NULL `JPATH`='$.url_shasums',
`url_shasums_signatures` char(99) NOT NULL `JPATH`='$.url_shasums_signatures[0]',
`url_source_repository` char(35) DEFAULT NULL `JPATH`='$.url_source_repository',
`version` char(19) NOT NULL `JPATH`='$.version'
) ENGINE=CONNECT DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_uca1400_ai_ci `TABLE_TYPE`='JSON' `HTTP`='https://api.releases.hashicorp.com/v1/releases/consul'
Interesting, isn’t it? Let’s look at the first column, as an example:
`builds_arch` char(5) NOT NULL `JPATH`='$.builds[0].arch',
We can see that every row-object contains a nested array called builds
that contains objects with these properties: arch
, os
, url
. By default, the related columns are colled builds_arch
, builds_os
and builds_url
, and they contain the values from the first element of the builds
array.
CONNECT uses a purpose-built language called JPath to specify this. If you know JSONPath you won’t find JPATH hard to understand.
Example 3: Array of objects, specifying columns manually
Let’s use the above example again. But suppose that the table we want to build is different from the one created by default by MariaDB. We can run a statement like this:
CREATE OR REPLACE TABLE consul_release (
version VARCHAR(100) NOT NULL,
not_stable VARCHAR(100) NOT NULL
FIELD_FORMAT = 'is_prerelease',
state VARCHAR(100) NOT NULL
FIELD_FORMAT = 'status.state',
timestamp_created TIMESTAMP NOT NULL
)
ENGINE = CONNECT,
TABLE_TYPE = JSON,
HTTP = 'https://api.releases.hashicorp.com/v1/releases/consul'
;
Now, here’s what we did:
- We only specified a few columns. Some JSON properties will simply be ignored and won’t be part of the table.
- The
version
andtimestamp_created
columns have the same name as the matching properties. So there is no need to specify where these values should come from. - The
not_stable
column has the same value as theis_prerelease
property, that is not nested. - The state column has the same values as the
state
property, nested in an object property calledstatus
.
Let’s query the table:
> SELECT * FROM consul_release;
+---------------------+------------+-----------+---------------------+
| version | not_stable | state | timestamp_created |
+---------------------+------------+-----------+---------------------+
| 1.16.5+ent.fips1402 | false | supported | 2024-01-23 20:40:10 |
| 1.16.5+ent | false | supported | 2024-01-23 20:40:05 |
| 1.16.5 | false | supported | 2024-01-23 20:31:17 |
| 1.15.9+ent | false | supported | 2024-01-23 20:28:28 |
| 1.17.2+ent.fips1402 | false | supported | 2024-01-23 20:26:59 |
| 1.17.2+ent | false | supported | 2024-01-23 20:26:53 |
| 1.15.9 | false | supported | 2024-01-23 20:23:38 |
| 1.17.2 | false | supported | 2024-01-23 20:23:21 |
| 1.17.1+ent.fips1402 | false | supported | 2023-12-14 23:16:04 |
| 1.17.1+ent | false | supported | 2023-12-14 23:15:57 |
+---------------------+------------+-----------+---------------------+
Note that only one status
nested object exists (it’s not in an array).
Example 4: Expanding a nested array
Now let’s discuss a different case. As mentioned before, the builds property is an array. So, for every table row, an array exists with multiple values.
Suppose we want to expand it, and change the logic: we want to have a row for each builds
element. Values that are outside that array will be repeated multiple times.
CREATE OR REPLACE TABLE consul_build (
version VARCHAR(100) NOT NULL,
arch VARCHAR(100) NOT NULL
FIELD_FORMAT = 'builds.[*].arch',
os VARCHAR(100) NOT NULL
FIELD_FORMAT = 'builds[*].os',
timestamp_created TIMESTAMP NOT NULL
)
ENGINE = CONNECT,
TABLE_TYPE = JSON,
HTTP = 'https://api.releases.hashicorp.com/v1/releases/consul'
;
As you can see, we specified builds.[*].arch
. This is a JPATH syntax which wouldn’t work in JSONPath. And it means that the builds
array must be expanded. Here’s a query example:
> SELECT * FROM consul_build
ORDER BY LEFT(version, 6) DESC
FETCH FIRST 10 ROWS WITH TIES;
+---------------------+-------+---------+---------------------+
| version | arch | os | timestamp_created |
+---------------------+-------+---------+---------------------+
| 1.17.2+ent.fips1402 | amd64 | linux | 2024-01-23 20:26:59 |
| 1.17.2+ent.fips1402 | arm64 | linux | 2024-01-23 20:26:59 |
| 1.17.2+ent.fips1402 | amd64 | windows | 2024-01-23 20:26:59 |
| 1.17.2+ent | amd64 | darwin | 2024-01-23 20:26:53 |
| 1.17.2+ent | arm64 | darwin | 2024-01-23 20:26:53 |
| 1.17.2+ent | 386 | freebsd | 2024-01-23 20:26:53 |
| 1.17.2+ent | amd64 | freebsd | 2024-01-23 20:26:53 |
| 1.17.2+ent | 386 | linux | 2024-01-23 20:26:53 |
| 1.17.2+ent | amd64 | linux | 2024-01-23 20:26:53 |
| 1.17.2+ent | arm | linux | 2024-01-23 20:26:53 |
| 1.17.2+ent | arm64 | linux | 2024-01-23 20:26:53 |
| 1.17.2+ent | s390x | linux | 2024-01-23 20:26:53 |
| 1.17.2+ent | amd64 | solaris | 2024-01-23 20:26:53 |
| 1.17.2+ent | 386 | windows | 2024-01-23 20:26:53 |
| 1.17.2+ent | amd64 | windows | 2024-01-23 20:26:53 |
| 1.17.2 | amd64 | darwin | 2024-01-23 20:23:21 |
| 1.17.2 | arm64 | darwin | 2024-01-23 20:23:21 |
| 1.17.2 | 386 | freebsd | 2024-01-23 20:23:21 |
| 1.17.2 | amd64 | freebsd | 2024-01-23 20:23:21 |
| 1.17.2 | 386 | linux | 2024-01-23 20:23:21 |
| 1.17.2 | amd64 | linux | 2024-01-23 20:23:21 |
| 1.17.2 | arm | linux | 2024-01-23 20:23:21 |
| 1.17.2 | arm64 | linux | 2024-01-23 20:23:21 |
| 1.17.2 | amd64 | solaris | 2024-01-23 20:23:21 |
| 1.17.2 | 386 | windows | 2024-01-23 20:23:21 |
| 1.17.2 | amd64 | windows | 2024-01-23 20:23:21 |
+---------------------+-------+---------+---------------------+
If you don’t understand the WITH TIES
syntax, see MariaDB: WITH TIES syntax.
As you can see, a row was returned for every element of builds
. Builds of the same release share the same values for version
and timestamp_created
.
Example 5: Transforming values
Sometimes the values returned by a remote API are not convenient for our applications. In this case, we may want to process and transform some values. This is something we can do with some MariaDB features that are not specific to CONNECT.
Let’s see an example first:
CREATE OR REPLACE TABLE consul_release (
version VARCHAR(100) NOT NULL,
state VARCHAR(100) NOT NULL
FIELD_FORMAT = 'status.state',
is_prerelease BOOL NOT NULL DEFAULT 0 INVISIBLE,
is_stable BOOL GENERATED ALWAYS AS (NOT is_prerelease) VIRTUAL,
timestamp_created TIMESTAMP NOT NULL
)
ENGINE = CONNECT,
TABLE_TYPE = JSON,
HTTP = 'https://api.releases.hashicorp.com/v1/releases/consul'
;
is_prerelease
is the column that contains the original value returned by the API. It is defined as an invisible column.is_prerelase
has a default value. This is required for invisible columns, but it has no effect for CONNECT JSON tables.is_stable
is a generated column, which contains the result off an expression (NOT is_prerelease
).
SELECT * FROM consul_release
will return the is_stable
column, but not the is_prerelease
column. It will still be possible to require this column explicitly by running:
SELECT *, is_prerelease FROM conssul_release;
Example 6: Querying a details API method with arguments
Many REST APIs have two types of GET methods:
- To retrieve an array of objects (list);
- To retrieve a single object (details).
Above we used two methods of the first type. Now we’re approaching a method of the second type.
And we have a problem: we have a single parametrised URL to call for any object we want to see. But for CONNECT there’s no such thing as parameters in URLs, so every API call should be handled by a different table. And we don’t want to have a CONNECT table for every release of every HashiCorp product.
What can we do about it? Well, at a low level we’ll actually have those tables. But we can hide the mechanism that creates and queries the table inside a stored procedure.
DELIMITER ||
CREATE OR REPLACE PROCEDURE select_consul_release(IN i_release VARCHAR(100))
NOT DETERMINISTIC
MODIFIES SQL DATA
BEGIN
DECLARE v_sql TEXT;
DECLARE v_table_name VARCHAR(64) DEFAULT CONCAT('consul_release_', CONNECTION_ID());
SET v_sql := SFORMAT('
CREATE OR REPLACE TABLE {} (
version VARCHAR(100) NOT NULL,
state VARCHAR(100) NOT NULL
FIELD_FORMAT = ''status.state'',
is_prerelease BOOL NOT NULL DEFAULT 0 INVISIBLE,
is_stable BOOL GENERATED ALWAYS AS (NOT is_prerelease) VIRTUAL,
timestamp_created TIMESTAMP NOT NULL
)
ENGINE = CONNECT,
TABLE_TYPE = JSON,
HTTP = ''https://api.releases.hashicorp.com/v1/releases/consul/{}''
;
', v_table_name, i_release);
EXECUTE IMMEDIATE v_sql;
EXECUTE IMMEDIATE CONCAT('SELECT * FROM ', v_table_name, ';');
EXECUTE IMMEDIATE CONCAT('DROP TABLE ', v_table_name, ';');
END;
||
DELIMITER ;
This procedure creates a CONNECT table from a REST URL that contains the desired argument (the release name). Then it runs a SELECT
to return its contents, and destroys the table. If multiple connections are likely to use this procedure concurrently, they won’t interfere with each other because the table name contains the current connection id.
Alternatively, you might want multiple connections to share existing CONNECT tables. To do so, use CREATE TABLE IF NOT EXISTS
(rather than OR REPLACE
) and use the release name as part of the table name (rather than the connection id).
Proof that it works:
> CALL select_consul_release('1.17.2');
+---------+-----------+-----------+---------------------+
| version | state | is_stable | timestamp_created |
+---------+-----------+-----------+---------------------+
| 1.17.2 | supported | 1 | 2024-01-23 20:23:21 |
+---------+-----------+-----------+---------------------+
These procedures won’t work in MySQL because they use the unique MariaDB features:
- The CONNECT engine, obviously;
CREATE OR REPLACE TABLE
, which is similar toDROP TABLE
+CREATE TABLE
, but it’s atomic and less verbose;SFORMAT()
, to compose strings by interpolation (see How to compose strings in MariaDB);EXECUTE IMMEDIATE
, which has some limitations but can normally be used as a shortcut forPREPARE
+EXECUTE
+DEALLOCATE PREPARE
.
Materialising a CONNECT table
We mentioned before that a CONNECT table can be materialised. This means that it becomes a regular table, which normally uses the InnoDB storage engine, and the remote data is written to the local database.
One off materialisation
The simplest and fastest way to materialise a CONNECT table is the following:
ALTER TABLE table_name ENGINE = InnoDB;
RENAME TABLE table_name TO new_name;
I don’t like to give numbers that might depend on a lot of factors, but in most cases it will take less than 5 seconds even for tables that contain a few millions of rows.
Incremental materialisation
Your logic might be more complex than this. Maybe you want to incrementally import data from a remote API every night. This means that you can’t create a new table every time.
The first time you import data you will need to create the table with the current data. You can use the method I have shown above, for a one off materialisation.
Incremental imports can be done in this way:
INSERT INTO materialised_table
(column1, column2, column3)
SELECT column1, column2, column3 FROM connect_table
;
JSON file
I left this method as the last one because it’s rarely useful. But you might want to know about it if you need a JSON file, and anyway it tells us something about how CONNECT works internally.
You might have noticed that all the above CONNECT tables use the JSON TABLE_TYPE
. This table type was originally designed to use JSON files as relational tables. To do this, one has to specify a file name:
CREATE TABLE table_name
...
ENGINE = CONNECT,
TABLE_TYPE = JSON,
FILE_NAME = 'filename.json';
If the file exists in the data directory, it will be used. If not, it will be created.
But even if we didn’t specify a file name, one was created anyway. If you ran the CREATE TABLE
s above, you probably noticed warnings like this:
*************************** 1. row ***************************
Level: Warning
Code: 1105
Message: No file name. Table will use hashicorp_product.json
And if you look in the data directory, in the database subdirectory, we’ll see this:
# ls -l /var/lib/mysql/test/hashicorp_product.*
-rw-rw---- 1 mysql mysql 907 Jan 27 14:23 /var/lib/mysql/test/hashicorp_product.frm
-rw-rw---- 1 mysql mysql 0 Jan 27 14:23 /var/lib/mysql/test/hashicorp_product.json
Initially the file is empty. After we run a query, it fill be filled with the output received from the REST API, regardless any WHERE
in our query:
# cat hashicorp_product.json
["athena-cli","atlas-upload-cli",...]
So when we query a CONNECT JSON table, CONNECT will check the HTTP
table option. If it was specified, CONNECT will call that URL and populate (or overwrite) the JSON file. The rest of the operations are performed as if it was a regular file-based CONNECT table.
Limitations
CONNECT JSON tables have some important limitations:
- Only the GET method is supported. This is theoretically correct because only GET should be used to read data, but in practice some APIs might require a different method.
- Some APIs require an additional call for authorisation. CONNECT has no way to make an additional call.
- No way to specify an HTTP(S) header to add to the requests. This is often required for authorisation (you call an authorisation method, you receive back a token valid for a period of time, and you add that token to the next requests, in the
Autorization
HTTP header). But these features are independent: tokens are not always passed in an HTTP header, and HTTP headers have other uses too. - There is no way to see the response HTTP code, nor any other response headers (useful for redirects and errors).
- If a token or an API key is written in the URL, it will be readable with the
SHOW CREATE TABLE
command or by querying theinformation_schema.TABLES
table. - Every URL is unique, even if it contains parameters. We can’t use the WHERE clause to modify the actual URL that is called.
- Some APIs have a usage limit. For example they might accept no more than 10 calls per hour, or 100 per day. While the contents of an API can be materialised as discussed above, it would be very handy if CONNECT could cache results of an API call for a period of time.
Some of these limitations can be removed or lifted by implementing the necessary features as stored procedures. I will demonstrate this in a future article.
Conclusions
The CONNECT storage engine allows easy data integration between MariaDB and other technologies. In this article we saw how to use CONNECT to query a REST API. We discussed several examples, progressively more complex, which covers the majority of the problems that you might have when working with CONNECT and a REST API.
Should you have questions, or problems with CONNECT, don’t hesitate to contact us for a consultation.
Federico Razzoli
0 Comments