It’s January 2025, so it’s a good time to look back, and write my thoughts on the most important events and trends of 2024 in the database world.
Would you like to share your thoughts? I’d be happy to read your comments.
How does MariaDB compare to MySQL?
MariaDB Foundation sponsored benchmarks by a well-known, independent consultant, Mark Callaghan. Mark worked for Informix, Oracle, Google, Facebook, MongoDB, Rockset among others. The results of his benchmarks have been published in an article titled Sysbench on a medium server: MariaDB is faster than MySQL.
Marko Mäkelä from MariaDB published Recent InnoDB Enhancements that Boost Performance in MariaDB Server to explain some of the MariaDB improvements that made this result possible.
However, Mark Callaghan pointed out that a cause of MySQL disappointing performance in his benchmark is regression bugs. While he reported bugs to MySQL (as he’s been doing for many years now), he recently published another remarkable post: MySQL 8.0.40 does not fix the regressions I hoped it would fix.
It’s also worth pointing out MySQL bugs that shouldn’t have been in 8.0, which has been Generally Available since 8.0.11, and is the current bugfix version since 8.0.34. Of course, bugs can always make it to any version, but very bad bugs should be rare. Here’s my list of the worst bugs:
- Bug 115517, made private for unspecified reasons, prevents MySQL from starting if you have more than 8,000 tables (not uncommon if you’re a SaaS provider, maybe even a hosting provider). More details in PS-9306, on Percona Server bug tracker.
ALTER TABLE ... ALGORITHM=INSTANT
caused a table corruption that was impossible to recover. See this article from Percona website. Mentioned bugs were made private for unspecified reasons.- The version where this bug was introduced was silently removed by AWS. If you already used it, you could not have replicas with the same version.
- Bug 115608: Inplace ALTER TABLE might cause lost rows if concurrent purge is another dangerous bug, as it can cause data loss.
However, if you use HeatWave, MySQL cloud offer, you’ll benefit from a better quality assurance, better performance and some interesting features, such as vector indexes and JavaScript stored procedures.
Awesome MariaDB and Awesome Innovative Databases
In the past year, we started two Awesome lists that I want to tell you about, as you might find valuable information in them.
The Awesome MariaDB list is a project sponsored by the MariaDB Foundation. It consists of various sublists for different audiences:
- Awesome MariaDB for Database Administrators
- Awesome MariaDB for Developers
- Awesome MariaDB for Data Engineers
- Awesome MariaDB Internals and Plugins Development
You can read the announcement.
We started an Awesome Innovative Databases list. This is the result of the constant research we do at Vettabase, to find the best solutions for customers who need innovative technologies.
New versions
Let me highlight some great new releases we got in 2024.
Prometheus 3
As the announcement says, this new release took seven years! It has many exciting features, like Remote Writes API optimisations, richer interoperability with OpenTelemetry, and better (but still experimental) histograms.
However, I want to highlight a particular feature that shows the advantages of the Open Source model: PromLens. This technology was developed by PromLabs and Chronospher, and then donated to the Prometheus project. Thanks to this donation, Prometheus 3 has a fresher, more modern UI. The new interface is enabled by default, but in case of problems it’s still possible to use the old one. I don’t expect the old interface to be maintained forever.
Note that Prometheus 3 introduces breaking changes to the configuration and to PromQL.
PostgreSQL 17
PostgreSQL is expanding its user base, partly thanks to MySQL’s users loss. PostgreSQL 17 was released in September. Let me highlight the most important improvements it has in my opinion, and… the improvements it needs, and doesn’t have.
There are improvements to the IO layer, which make WAL logging and sequential scans faster. Vacuum has also been made faster.
Logical replication now has failover. There is a new tool, pg_createsubscriber, to convert an existing physical standby into a logical replica. Version upgrades using logical replication have been made simpler.
As much as I like PostgreSQL and I welcome PostgreSQL 17, I’d like to see part of its development following a different direction. Optimising vacuum, is an improvement, but vacuum should not be necessary at all. It’s the consequence of choices made in the 80’s, when PostgreSQL development started, and scalability wasn’t needed at all. But nowadays, PostgreSQL should invest more efforts in storage engines support, and develop a modern storage engine that updates rows in-place. This would eliminate the biggest reasons why vacuum is necessary.
Cassandra 5
Apache Cassandra 5.0 was released in September.
JDK version 17 is now supported.
The most important improvement is, in my opinion, Storage-Attached Indexing (SAI). They replace an older feature, Secondary Indexes, which should not be used in production. SAI indexes are local to each node. When the cluster bootstraps or a node is decommissioned, SAI indexes are streamed between nodes, they are not rebuilt by other nodes separately. For each SAI index, two separate structures are created: a Memtable index, and a SSTable index. Data in these structures may frequently conflict with each other, and conflicts are solved during the read path. Each SAI index contains the indexed columns and a reference to the primary key entries, as it happens for secondary indexes in other DBMSs. SAI indexes speed up most read operations against string and numeric data.
SAI indexes also support indexing of vector data, and are able to perform ANN searches.
Dynamic Data Masking (DDM) is also supported. Data is stored normally in Cassandra tables, but is obfuscated on the fly during SELECT
s. Several methods to obfuscate data are supported. This includes nullification, type-sensitive replacement, replacement with a constant string, or even partial obfuscation. A column obfuscation method is normally defined in the CREATE TABLE
statement. Querying the original data is still possible for users that have the UNMASK
privilege.
Various other optimisations were introduced, including Tier Memtables and Tier SSTables.
Databases and AI
The base concept of AI is vectors. Objects analysed by AI are represented as multi-dimensional vectors having an impressively high number of dimensions. I believe that vectors with a million dimensions are common. Imagine these objects as geometrical vectors, in a space with many dimensions. An ANN search takes one vector in input, and returns the closest vectors found. “Closest” is a vague term here, because there are a lot of functions that can be used to calculate the distance between two vectors: the Euclidean distance, the cosine distance, and many more. Some are apparently exotic, such as the Manhattan distance, which returns a walking distance between two objects as if they were located in Manhattan (or any other city with a similar map). But every function is the best for some domains.
In practice, vectors can be almost anything. For example:
- Vectors can be face images, and can be used to implement facial recognition.
- Vectors can be words, and can be used to implement agents or bots.
- Vectors can represent a user behaviour, or products, and can be used to implement a recommendation engine.
To support vector searches, a database needs to support the following features:
- A
VECTOR
datatype; - Vector indexes;
- Vector distance functions.
Some databases were built specifically to work with vectors. Examples are TileDB and Pinecone. Some of them might focus on features and optimisations that are important when we work with vectors. But vectors can very well be implemented in a relational database, or in any other database type. Here we’ll focus on existing databases supporting vectors.
pgvector and PostgresML
PostgreSQL doesn’t support any feature related to AI, but it has two great extensions that add AI capabilities: pgvector and PostgresML.
pgvector adds support for a VECTOR
type with size. Columns of this type can be indexed. It also adds some additional syntax and functions for various distance functions, like Euclidean (L2), inner product, cosine distance, L1, Hamming distance, and Jaccard distance.
PostgresML also adds support for vectors and indexes. But it’s a more complete solution, that is also able to generate the embeddings, train the model, and tune it after the training.
MariaDB vectors
MariaDB has a more limited vector feature set, and at the time of this writing it hasn’t released an LTS version with vectors. It supports a VECTOR type with size, which can be indexed. It supports functions for the Euclidean distance (L2) and the cosine distance. MariaDB vector indexes use a variation of HNSW.
MariaDB published a github repository that shows how MariaDB vectors can be used to easily develop RAG applications. The tool generates the embeddings from the previously trained documentation contents, using an OpenAI model. When you query the script with a request in English (natural language) it will find the closest embedding, it will send it to OpenAI along with the requests, and will display a natural language response generated by OpenAI.
Mark Callaghan made a benchmark that shows that MariaDB vectors performance is remarkably higher than pgvector’s performance. See the part 1 and part 2 of his post.
MySQL and HeatWave vectors
HeatWave, MySQL cloud offer, supports vector searches.
MySQL 9.0 Community Edition supports the VECTOR
type and some vector functions, but vectors cannot be indexed.
Cassandra vectors
Cassandra 5 implemented vectors as part of CASSANDRA-18504. The VECTOR
type has a size, and its elements have a type as well (integers or, more commonly, floats). To be able to make a vector search, we need to index a VECTOR
column, and we can do that thanks to SAIs (discussed earlier in this post). Vectors can also be passed to, or returned from, user-defined functions. Supported distance functions are Euclidean (L2), cosine and dot product.
To create a vector index and then use it:
CREATE INDEX vec_doc_text
ON document
USING 'SAI'
WITH OPTIONS = {
'similarity_function': 'DOT_PRODUCT'
}
;
SELECT doc_title, doc_text
FROM document
ORDER BY doc_text ANN OF [0.07, 0.08, 0.09, 0.12, ... ]
LIMIT 1
;
MindsDB
It’s also worth mentioning MindsDB, a metadatabase for machine learning that we support. MindsDB can be queried via SQL (MySQL or PostgreSQL protocol), a REST API, or the MongoDB query language. Data is retrieved from remote sources, such as databases, REST APIs or data files. A huge variety of data sources is supported. We can run queries about the future. Yes, I mean exactly this: we can ask for data that is not and cannot be there yet, such as our sales in the South East of Asia next year, aggregated by month. MindsDB will use machine learning models to make predictions, and will answer with a normal dataset. It has default models, but we can choose to use a specific model, including paid or free remote services. MindsDB supports many different models.
MindsDB releases are very frequent. In 2024, we contributed some usability improvements to the MySQL integration, in the context of our partnership with MindsDB. The MindsDB team has also implemented many improvements to MindsDB itself, and added many integrations. It’s worth noting that they have added integrations with many object and file storage services, such as S3, Azure Blob Storage, MS OneDrive, and DropBox.
License Changes
The DBMS license change saga started some years ago and it has not finished yet. Let’s see the most important changes this year.
Redis became proprietary
Redis license changed from BSD-3 to Redis Source Available License. Immediately, two forks were created: Valkey, which is under the umbrella of the Linux Foundation, and Redict. Redis Ltd has also acquired Speedb, whose website has disappeared.
Salvatore Sanfilippo, Redis’ author, works for Redis Ltd again. His return might be good for Redis reputation. In a HackerNews comment, Salvatore (nickname: antirez) clarified that he doesn’t oppose Redis Ltd choices, nor any Redis fork. He wrote that he has chosen the BSD-3 license to make commercial forks possible. He also highlighted the importance of choosing a license mindfully. Software authors should not use licenses that permit actions that they will complain about.
Elasticsearch is Open Source again
Elasticsearch made the opposite change. After switching to a source available license in 2021, Elasticsearch became open source again in 2024. Now they use the GNU AGPL 3 license.
OpenSearch is an Elasticsearch fork created by Amazon as a response to the Elasticsearch license change in 2021. Amazon announced that OpenSearch is being donated to the Linux Foundation.
ScyllaDB becomes source available
In December, ScyllaDB has announced important changes. They will only develop the ScyllaDB Enterprise edition. The Open Source (AGPL) edition is being discontinued. The new ScyllaDB releases will be covered by a source available license. The AGPL is not being completely abandoned, though: it will be used for Scylla Manager, which is currently source available.
About the GNU AGPL
The GNU AGPL is recognised as Open Source and Free Software. Though, interestingly, it solves the same problems as source available licenses, posing more or less the same type of restrictions: if you modify the code and make it available as a cloud service (or, more generally, via a network), you are distributing it, so you need to distribute the modified source code as AGPL 3. It essentially extends the meaning of distribution by including network usage.
The reason why AGPL is less frequently used than source available licenses is that there are doubts about its solidity, from a legal perspective. However it is used by several software projects, and Elacticsearch’ decision might bring new trust to this license.
What will we see in the new year?
Short answer: I don’t know. Just like those who tell you exactly what will happen in 2025, I don’t have paranormal powers! 🙂
But I can tell you what I, and Vettabase, will look at closely:
- In-database AI. I’m not convinced that the current data structure, distance functions, etc, are the same that we’ll use in five years. I want to look closely at innovative technologies and market needs.
- MariaDB performance improvements and new features. I don’t see a bright future for MySQL, especially in its Community Edition, and MariaDB is the most credible replacement for existing MySQL databases.
- Awesome innovative databases, as I already mentioned. Some of those technologies look promising. I’m not looking to replace existing battle-tested technologies, but we can integrate them with some of these new projects.
Federico Razzoli
0 Comments