You might know that we maintain a list of Awesome Innovative Databases on GitHub. At the time of this writing, the first item in the list of active projects is curious: Dolt. If English isn’t your mother tongue, you might not know that it is a synonymous for idiot or… git. Someone told me that probably Linus chosen the name git because of its assonance with get and didn’t consider that it is also an insult. But anyway, Dolt’s README states: Dolt is Git for Data!
I have played a bit with it to get an idea of what it can do, how well it works, and when it can be useful. I don’t have all the answers yet: before using it in a project I’d probably need to investigate a bit more. But here you can see the results of my tests and my early thoughts.
Setup
For a simple functional test I wanted a basic, standard setup. I used the Docker image, with its default configuration.
A first look at the architecture
Dolt uses the MySQL wire protocol, MySQL query dialect, and some MySQL-like features. You’ll see many similarities, don’t be surprised. A variant called DoltgreSQL exists, based on PostgreSQL wire protocol and dialect. But we won’t discuss it here.
To start dolt we run the dolt
binary or service. This is a binary that allows to run all the operations we need. It doesn’t come with any additional tool.
Once started, it uses the following files:
# lsof -p 7
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dolt 7 root cwd DIR 253,2 4096 4210780 /var/lib/dolt
dolt 7 root rtd DIR 0,63 4096 4210768 /
dolt 7 root txt REG 0,63 103071352 4210698 /usr/local/bin/dolt
dolt 7 root 0u CHR 1,3 0t0 5 /dev/null
dolt 7 root 1w FIFO 0,14 0t0 22734034 pipe
dolt 7 root 2w FIFO 0,14 0t0 22734035 pipe
dolt 7 root 3u REG 0,63 0 4210860 /tmp/c0d108e6-b640-4808-a706-3e8581777d9e
dolt 7 root 4u a_inode 0,15 0 168 [eventpoll]
dolt 7 root 5u a_inode 0,15 0 168 [eventfd]
dolt 7 root 6u IPv6 22734313 0t0 TCP *:mysql (LISTEN)
dolt 7 root 7u IPv4 22734319 0t0 TCP 624ef13dbfdd:54628->ec2-54-185-144-36.us-west-2.compute.amazonaws.com:https (ESTABLISHED)
I didn’t test DoltHub.
Dolt has some system databases:
mysql
(only a few tables);information_schema
(all tables).
MySQL commands that are used by clients are implemented, including:
SET SESSION
,SET GLOBAL
,SET SESSION
, SET,@@global
,@@session
,@@
;schema()
anddatabase()
funcitons;SHOW [SESSION | GLOBAL] VARIABLES
;SHOW [SESSION | GLOBAL] STATUS
;- SHOW commands (though some of them return an empty set).
Most server variables that I tested do nothing. Dolt has its own settings, though.
Most status variables don’t seem to be used. There are exceptions, like the Com_*
variables.
This seems to be enough. To test SQL queries, I used the built-in CLI client by starting it with dolt sql
. But I also tried the mariadb
CLI client and mycli, and they worked allright.
SQL support
Dolthub forked Vitess, and removed most of the code from it. Parts of Vitess are used to parse SQL queries, implement types, and more. Vitess is a MySQL-based distributed DBMS that implements sharding natively. But Dolt doesn’t use it for this purpose.
The execution of parsed SQL queries is implemented as a different project: go-mysql-server, which describes itself as a MySQL-compatible database engine written in pure Go. For each command, there is also a wrapper in Dolt.
Support is missing or problems were found at least for the following syntaxes:
- MariaDB specific features: the target here seems to be MySQL 8 compatibility;
- Window functions;
WITH
clause;LATERAL
joins;GROUP BY ... WITH ROLLUP
;- Multi-table
DELETE
s; - The
UNKNOWN
keyword; - Views are not writable;
HELP
command.
If you are interested in Dolt but any of these syntaxes is important for you, you may file an issue on GitHub.
Stored procedures, triggers and events are supported.
Details about missing features and bugs
These queries from the MariaDB KnowledgeBase return different results, in terms of row type (which results in a loss of information) and row order:
SELECT name, test, score, AVG(score) OVER (PARTITION BY test)
AS average_by_test FROM student;
SELECT name, test, score, AVG(score) OVER (PARTITION BY name)
AS average_by_name FROM student;
Queries that result in an infinite loop crash Dolt:
WITH RECURSIVE cte (depth, from_, to_) AS (
SELECT 0,1,1 UNION DISTINCT SELECT depth+1, t1.from_, t1.to_
FROM t1, cte WHERE t1.from_ = cte.to_
)
SELECT * FROM cte;
This LATERAL
join from the MySQL documentation fails saying that the salesperson table doesn’t exist (which is false):
SELECT
salesperson.name,
max_sale.amount,
max_sale.customer_name
FROM
salesperson,
-- find maximum size and customer at same time
LATERAL
(SELECT amount, customer_name
FROM all_sales
WHERE all_sales.salesperson_id = salesperson.id
ORDER BY amount DESC LIMIT 1)
AS max_sale;
Playing with data versioning
Let’s see how to version data with Dolt.
CLI commands and stored procedures
Dolt provides two ways to version data. We can use the dolt
binary with arguments that are the same as git
parameters. Or we can use Dolt system stored procedures, which accept the same arguments, passed as strings. Actually dolt
subcommands are just bridges to call the stored procedures. It is insecure and unpractical for an application to run shell commands in the database server, unless it’s local, so we’re going to use stored procedures instead.
Dolt stored procedures are written in Go, and can be found in the go/libraries/doltcore/sqle/dprocedures
directory. Go stored procedures are supported by go-mysql-server. It should be possible to easily add more Go procedures.
Basic git-like operations
Data is not versioned by default. To start versioning data, the first thing to do is to create a branch. Note that a branch exists inside a database.
When a database is created, it contains a branch called main. Let’s create a database, and see the automatically created branch:
CREATE DATABASE sales;
USE sales;
> SELECT * FROM dolt_branches \G
*************************** 1. row ***************************
name: main
hash: h720ftbdefm9elt8acr54qjkafmi75i3
latest_committer: Dolt System Account
latest_committer_email: do******@do*****.com
latest_commit_date: 2024-12-26 21:17:50
latest_commit_message: Initialize data repository
remote:
Now let’s create some data, let’s add it to the staging area, and commit:
CREATE TABLE seller (
id INTEGER UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL
);
INSERT INTO seller (id, name) VALUES
(1, 'John Smith')
, (2, 'Hellen McColl')
, (3, 'May Carson')
;
CREATE TABLE sale (
id INTEGER UNSIGNED AUTO_INCREMENT PRIMARY KEY,
seller_id INTEGER UNSIGNED NOT NULL,
date DATE NOT NULL,
amount INTEGER UNSIGNED NOT NULL
);
INSERT INTO sale (seller_id, date, amount) VALUES
(1, '2024-06-01', 50)
, (1, '2024-06-02', 50)
, (2, '2024-06-01', 100)
, (2, '2024-06-02', 100)
, (2, '2024-06-02', 200)
, (3, '2024-06-01', 75)
, (3, '2024-06-02', 100)
, (3, '2024-06-02', 100)
;
> CALL dolt_add('-A');
+--------+
| status |
+--------+
| 0 |
+--------+
> CALL DOLT_COMMIT('-m', 'Initial data');
+----------------------------------+
| hash |
+----------------------------------+
| 69gje0coiqvg9rfr197e5usmk3nvfmcc |
+----------------------------------+
Let’s also see some stats (number of sales and amount earned by seller and date):
> SELECT
seller.name, sale.date,
COUNT(*) AS count, SUM(amount) AS total_amount
FROM seller
LEFT JOIN sale
ON seller.id = sale.seller_id
GROUP BY seller.name, sale.date
ORDER BY seller_id, date
;
+---------------+------------+-------+--------------+
| name | date | count | total_amount |
+---------------+------------+-------+--------------+
| John Smith | 2024-06-01 | 1 | 50 |
| John Smith | 2024-06-02 | 1 | 50 |
| Hellen McColl | 2024-06-01 | 1 | 100 |
| Hellen McColl | 2024-06-02 | 2 | 300 |
| May Carson | 2024-06-01 | 1 | 75 |
| May Carson | 2024-06-02 | 2 | 200 |
+---------------+------------+-------+--------------+
Now let’s delete one of the sellers. We’ll have to stage the changes again, then we can commit.
DELETE FROM seller WHERE id = 1;
DELETE FROM sale WHERE seller_id = 1;
> CALL dolt_add('-A');
+--------+
| status |
+--------+
| 0 |
+--------+
> CALL dolt_commit('-m', 'Remove John Smith');
+----------------------------------+
| hash |
+----------------------------------+
| j9rmqfdoi50dplc6r6c5rinsmbkmd13g |
+----------------------------------+
Let’s see the commits:
> SELECT * FROM dolt_commits ORDER BY date;
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
| commit_hash | committer | email | date | message |
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
| h720ftbdefm9elt8acr54qjkafmi75i3 | Dolt System Account | do******@do*****.com | 2024-12-26 21:17:50 | Initialize data repository |
| 69gje0coiqvg9rfr197e5usmk3nvfmcc | rt | rt@% | 2024-12-26 21:20:44 | Initial data |
| j9rmqfdoi50dplc6r6c5rinsmbkmd13g | rt | rt@% | 2024-12-26 21:25:56 | Remove John Smith |
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
The first commit was made automatically when we created the database, so the committer is Dolt System Account
. Later commits were made by rt
, which is the account I used to login. I haven’t configured an email because I don’t intend to use DoltHub, so the committer’s email was set automatically.
Let’s see how stats have changed after our last commit:
> SELECT
seller.name, sale.date,
COUNT(*) AS count, SUM(amount) AS total_amount
FROM seller
LEFT JOIN sale
ON seller.id = sale.seller_id
GROUP BY seller.name, sale.date
ORDER BY seller_id, date
;
+---------------+------------+-------+--------------+
| name | date | count | total_amount |
+---------------+------------+-------+--------------+
| Hellen McColl | 2024-06-01 | 1 | 100 |
| Hellen McColl | 2024-06-02 | 2 | 300 |
| May Carson | 2024-06-01 | 1 | 75 |
| May Carson | 2024-06-02 | 2 | 200 |
+---------------+------------+-------+--------------+
Then we can check how data changed in the seller
table:
> SELECT * FROM dolt_history_seller;
+----+---------------+----------------------------------+-----------+---------------------+
| id | name | commit_hash | committer | commit_date |
+----+---------------+----------------------------------+-----------+---------------------+
| 2 | Hellen McColl | j9rmqfdoi50dplc6r6c5rinsmbkmd13g | rt | 2024-12-26 21:25:56 |
| 3 | May Carson | j9rmqfdoi50dplc6r6c5rinsmbkmd13g | rt | 2024-12-26 21:25:56 |
| 1 | John Smith | 69gje0coiqvg9rfr197e5usmk3nvfmcc | rt | 2024-12-26 21:20:44 |
| 2 | Hellen McColl | 69gje0coiqvg9rfr197e5usmk3nvfmcc | rt | 2024-12-26 21:20:44 |
| 3 | May Carson | 69gje0coiqvg9rfr197e5usmk3nvfmcc | rt | 2024-12-26 21:20:44 |
+----+---------------+----------------------------------+-----------+---------------------+
We can run various queries to see how data evolved between two commits.
To find deleted rows:
-- deleted rows
SELECT old.*
FROM (
SELECT *
FROM dolt_history_seller
WHERE commit_hash = '69gje0coiqvg9rfr197e5usmk3nvfmcc'
) old
LEFT JOIN (
SELECT *
FROM dolt_history_seller
WHERE commit_hash = 'j9rmqfdoi50dplc6r6c5rinsmbkmd13g'
) new
ON old.id = new.id
WHERE new.id IS NULL
;
+----+------------+----------------------------------+-----------+---------------------+
| id | name | commit_hash | committer | commit_date |
+----+------------+----------------------------------+-----------+---------------------+
| 1 | John Smith | 69gje0coiqvg9rfr197e5usmk3nvfmcc | rt | 2024-12-26 21:20:44 |
+----+------------+----------------------------------+-----------+---------------------+
Get the number of rows per commit, and the number of added/deleted rows compared to the previous commit:
> SELECT
commit_id, row_count,
row_count - LAG (row_count, 1) OVER (ORDER BY commit_id) AS count_delta
FROM (
SELECT
CONCAT(commit_date, '.', commit_hash) AS commit_id,
COUNT(*) AS row_count
FROM dolt_history_seller
GROUP BY commit_id
ORDER BY commit_id
) v
;
+----------------------------------------------------------+-----------+-------------+
| commit_id | row_count | count_delta |
+----------------------------------------------------------+-----------+-------------+
| 2024-12-26 21:20:44.499.69gje0coiqvg9rfr197e5usmk3nvfmcc | 3 | NULL |
| 2024-12-26 21:25:56.011.j9rmqfdoi50dplc6r6c5rinsmbkmd13g | 2 | -1 |
+----------------------------------------------------------+-----------+-------------+
Recovering data
Soft and hard DROP DATABASE
With DROP DATABASE
, we perform a database safe drop:
DROP DATABASE sales;
But if we made at least one commit, we can still recover the database and all its data:
> CALL dolt_undrop('sales');
+--------+
| status |
+--------+
| 0 |
+--------+
> SHOW TABLES IN sales;
+-----------------+
| Tables_in_sales |
+-----------------+
| sale |
| seller |
+-----------------+
If we want to hard-drop a database, we can do it in this way:
DROP DATABASE sales;
CALL dolt_purge_dropped_databases();
Reverting a commit
We can revert the last commit, or a certain number of most recent commits, with dolt_revert()
:
> SELECT * FROM dolt_commits ORDER BY date;
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
| commit_hash | committer | email | date | message |
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
| jcg4ugb85u5nkn2fel3qalq4b3a4c7b5 | Dolt System Account | do******@do*****.com | 2024-12-28 04:09:04 | Initialize data repository |
| nobdoc5ifbo1gb1ssn9olcmlehc4abl5 | rt | rt@% | 2024-12-28 04:09:39 | Initial data |
| t4ihnlgv7cgsmpc39vg7hf4u311fss5r | rt | rt@% | 2024-12-28 04:11:16 | Remove John Smith |
| 0k0mhk21mljgbn5iqu2t374tkgvjo7kk | rt | rt@% | 2024-12-28 04:12:28 | Remove Hellen |
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
> CALL dolt_revert('HEAD~1');
+--------+
| status |
+--------+
| 0 |
+--------+
> SELECT * FROM dolt_commits ORDER BY date;
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
| commit_hash | committer | email | date | message |
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
| jcg4ugb85u5nkn2fel3qalq4b3a4c7b5 | Dolt System Account | do******@do*****.com | 2024-12-28 04:09:04 | Initialize data repository |
| nobdoc5ifbo1gb1ssn9olcmlehc4abl5 | rt | rt@% | 2024-12-28 04:09:39 | Initial data |
| t4ihnlgv7cgsmpc39vg7hf4u311fss5r | rt | rt@% | 2024-12-28 04:11:16 | Remove John Smith |
| 0k0mhk21mljgbn5iqu2t374tkgvjo7kk | rt | rt@% | 2024-12-28 04:12:28 | Remove Hellen |
| n003l5u55bge2q3eap5u9gfgsim7nmaf | rt | rt@% | 2024-12-28 04:16:12 | Revert "Remove John Smith" |
+----------------------------------+---------------------+----------------------+---------------------+----------------------------+
Similarly to git, to revert two commits specify HEAD~2
.
Undo uncommitted changes
We can undo uncommitted changes (reset to the last commit) with dolt_reset()
:
> SELECT * FROM seller;
+----+------------+
| id | name |
+----+------------+
| 1 | John Smith |
+----+------------+
> CALL dolt_reset('--hard');
+--------+
| status |
+--------+
| 0 |
+--------+
> SELECT * FROM seller;
+----+------------+
| id | name |
+----+------------+
| 1 | John Smith |
| 3 | May Carson |
+----+------------+
Importing and exporting data
It is unlikely that someone uses Dolt as their main or only database. It is more likely for a company to use MariaDB or MySQL for persistence, and move data to Dolt to analyse how data change over time, make predictions about the future, and even make hypothesis-based prediction. Predicted data, or data subject to hypothesis, can also be moved back to MariaDB or MySQL to be able to analyse it using more advanced SQL. Let’s see how we can exchange data between MariaDB/MySQL and Dolt.
Importing a MariaDB dump
I created the above data into MariaDB. Then I tried to import data from MariaDB to Dolt in this way:
mariadb -hdolt -urt --skip-ssl | mariadb-dump -pproot --databases sales --single-transaction
For some reason, the above command hangs instead of reporting an error. When I tried to use a file instead of piping mariadb-dump
into mariadb
, I got this error:
ERROR 1105 (HY000) at line 23: Unknown collation: utf8mb4_uca1400_ai_ci
This is the default collation in MariaDB, but Dolt doesn’t support it. So, I need to change the default collation to utf8mb3_unicode_ci
and make sure that the character set is utf8mb4
:
mariadb-dump -pproot --databases sales --single-transaction | \
sed -e 's/utf8mb4_uca1400_ai_ci/utf8mb3_unicode_ci/g' | \
sed -e 's/utf8mb3/utf8mb4/g' | \
mariadb -hdolt -urt --skip-ssl
--skip-ssl
isn’t necessary if you configured Dolt to use SSL, but I didn’t do so.
Now let’s verify that the tables were successfully recreated in Dolt:
# mariadb -hdolt -urt --skip-ssl -e "SHOW TABLES FROM sales;" -ss
sale
seller
It worked!
Replication
Dolt states it supports replication between Dolt servers, as well as using MySQL or MariaDB as a master, or as a replica. However, my attempts to use MariaDB 11.4 or MySQL 8.4 as a master failed. In the case of MariaDB, I got a received unknown event
error, probably because Dolt doesn’t understand MariaDB GTID format. In the case of MySQL, the value of Replica_IO_Running was stuck to Connecting
, even if I was able to connect MySQL via the client using replication credentials.
It’s possible that, if I dedicated more time to this, I would have been able to solve the problems. So the fact that I failed doesn’t mean that replication from MySQL/MariaDB to Dolt is impossible. Just be aware that it might not be as straightforward as the documentation seems to suggest.
MariaDB CONNECT
MariaDB CONNECT is a storage engine that allows to connect to remote data sources and read or write to them as if they were local tables. Examples of remote data sources include remote databases using ODBC or JDBC, as well as data files in many formats. See our articles about CONNECT.
For every supported data source, CONNECT has a table type. In our case, we’ll use the MySQL table type to connect to Dolt.
First, let’s connect to Dolt and create a use to allow CONNECT to query the sales
database:
CREATE USER 'mariadb'@'%' IDENTIFIED BY 'DoggyDogWorld';
GRANT ALL PRIVILEGES ON sales.* TO 'mariadb'@'%';
Here I saved you from a problem I encountered. The good practice would be to only grant the SELECT
and SHOW VIEW
privileges, the latter being necessary for the discovery of a view (more on this in a moment). Unfortunately, with these permissions Dolt returns an error. At least another permission is necessary in Dolt but, to save time for this simple test, I just granted the user all privileges.
Now let’s go back to MariaDB and create a CONNECT table:
CREATE DATABASE sales_bridge;
USE sales_bridge;
CREATE OR REPLACE TABLE sales_by_date
ENGINE = CONNECT
TABLE_TYPE = MYSQL
CONNECTION='mysql://mariadb:DoggyDogWorld@dolt/sales/sales_by_date'
;
This works for the simplest cases, where we only want to read from the main
branch. The CONNECTION option provides an URL for the view that we want to link to our CONNECT table. In most cases, we don’t need to specify the table columns: CONNECT will discover columns automatically. Specifying columns is only necessary where autodiscovery fails, when the column types are not optimal, or when we want to exclude some columns.
Let’s test our query:
> SELECT * FROM sales_by_date;
+------------+-------+--------------+
| date | count | total_amount |
+------------+-------+--------------+
| 2024-06-01 | 3 | 225 |
| 2024-06-02 | 5 | 550 |
+------------+-------+--------------+
It works! As anticipated, this query reads data from the main
branch.
But what if we want to create two tables, for different branches? In theory, all we have to do is to specify the database in the form database/branch
, for example sales/delete-smith
. But /
is a special character, and I couldn’t find a way to escape it. I also tried an older CONNECT syntax, which consists in passing the database credentials to option_list
, but the character still needed to be escaped, and I couldn’t do it. The solution I’ve found is to use SRCDEF
to pass a query, rather than just a table name. More precisely, I’m passing to it two queries: the USE
query to choose the default database and branch, and a SELECT
query to read the view data.
CREATE OR REPLACE TABLE sales_by_date__main
ENGINE = CONNECT
TABLE_TYPE = MYSQL
CONNECTION='mysql://mariadb:DoggyDogWorld@dolt/sales'
SRCDEF='SELECT * FROM `sales/main`.sales_by_date'
;
> SELECT * FROM sales_by_date__main;
+------------+-------+--------------+
| date | count | total_amount |
+------------+-------+--------------+
| 2024-06-01 | 3 | 225 |
| 2024-06-02 | 5 | 550 |
+------------+-------+--------------+
CREATE OR REPLACE TABLE sales_by_date__delete_smith
ENGINE = CONNECT
TABLE_TYPE = MYSQL
CONNECTION='mysql://mariadb:DoggyDogWorld@dolt/sales'
SRCDEF='SELECT * FROM `sales/delete-smith`.sales_by_date'
;
> SELECT * FROM sales_by_date__delete_smith;
+------------+-------+--------------+
| date | count | total_amount |
+------------+-------+--------------+
| 2024-06-01 | 2 | 175 |
| 2024-06-02 | 4 | 500 |
+------------+-------+--------------+
Here you can see how to use Dolt in practice for hypothesis testing:
- We have some base data in the
main
branch; - We have a view called
sales_by_date
, and potentially other views, showing some statistics we are interested in; - We have a
delete-smith
branch, where one of the sellers was deleted, as well as the sales he closed; - By switching between these two branches and querying
sales_by_date
, we can see how the removal of a seller affected sales. - CONNECT allows us to create a table for each branch, and join
sales_by_date
to other tables or views, using more advanced features and a more advanced SQL dialect.
Conclusions
In this article, we moved some first steps with Dolt. We learnt how to run basic operations, like creating branches and committing data changes. We couldn’t setup replication, but we imported data from a dump, and we were able to make MariaDB and Dolt “talk” to each other using CONNECT. The last part shows an interesting example on how to use MariaDB for data persistent, Dolt for hypothesis creation, and finally query Dolt data from MariaDB, taking advantage of its richer SQL support.
If there is interest, we can move forward and investigate more specific aspects:
- DoltgreSQL a Dolt variant that connects to PostgreSQL, instead of MySQL and MariaDB;
- Put together a library of queries to examine how data changes over time;
- How to create vector indexes in MariaDB from hypothesis that are stored in Dolt;
- How to use MindsDB to connect Dolt and run predictions based on hypothesis.
Let me remind you that we cover a variety of different databases, and are experts in fields like data integration. If you’re interested in understanding if Dolt fits your use case and can help you run complex analysis in a simple way, please contact us for a consultation.
Federico Razzoli
0 Comments