Schedule Meeting

a

A Review of 2024 in the Database World

by | Jan 21, 2025 | Other Databases, Cassandra, MariaDB, MySQL, PostgreSQL

Need Help?  Click Here for Expert Support

It’s January 2025, so it’s a good time to look back, and write my thoughts on the most important events and trends of 2024 in the database world.

Would you like to share your thoughts? I’d be happy to read your comments.

How does MariaDB compare to MySQL?

MariaDB Foundation sponsored benchmarks by a well-known, independent consultant, Mark Callaghan. Mark worked for Informix, Oracle, Google, Facebook, MongoDB, Rockset among others. The results of his benchmarks have been published in an article titled Sysbench on a medium server: MariaDB is faster than MySQL.

Marko Mäkelä from MariaDB published Recent InnoDB Enhancements that Boost Performance in MariaDB Server to explain some of the MariaDB improvements that made this result possible.

However, Mark Callaghan pointed out that a cause of MySQL disappointing performance in his benchmark is regression bugs. While he reported bugs to MySQL (as he’s been doing for many years now), he recently published another remarkable post: MySQL 8.0.40 does not fix the regressions I hoped it would fix.

It’s also worth pointing out MySQL bugs that shouldn’t have been in 8.0, which has been Generally Available since 8.0.11, and is the current bugfix version since 8.0.34. Of course, bugs can always make it to any version, but very bad bugs should be rare. Here’s my list of the worst bugs:

  • Bug 115517, made private for unspecified reasons, prevents MySQL from starting if you have more than 8,000 tables (not uncommon if you’re a SaaS provider, maybe even a hosting provider). More details in PS-9306, on Percona Server bug tracker.
  • ALTER TABLE ... ALGORITHM=INSTANT caused a table corruption that was impossible to recover. See this article from Percona website. Mentioned bugs were made private for unspecified reasons.
    • The version where this bug was introduced was silently removed by AWS. If you already used it, you could not have replicas with the same version.
  • Bug 115608: Inplace ALTER TABLE might cause lost rows if concurrent purge is another dangerous bug, as it can cause data loss.

However, if you use HeatWave, MySQL cloud offer, you’ll benefit from a better quality assurance, better performance and some interesting features, such as vector indexes and JavaScript stored procedures.

Awesome MariaDB and Awesome Innovative Databases

In the past year, we started two Awesome lists that I want to tell you about, as you might find valuable information in them.

The Awesome MariaDB list is a project sponsored by the MariaDB Foundation. It consists of various sublists for different audiences:

You can read the announcement.

We started an Awesome Innovative Databases list. This is the result of the constant research we do at Vettabase, to find the best solutions for customers who need innovative technologies.

New versions

Let me highlight some great new releases we got in 2024.

Prometheus 3

As the announcement says, this new release took seven years! It has many exciting features, like Remote Writes API optimisations, richer interoperability with OpenTelemetry, and better (but still experimental) histograms.

However, I want to highlight a particular feature that shows the advantages of the Open Source model: PromLens. This technology was developed by PromLabs and Chronospher, and then donated to the Prometheus project. Thanks to this donation, Prometheus 3 has a fresher, more modern UI. The new interface is enabled by default, but in case of problems it’s still possible to use the old one. I don’t expect the old interface to be maintained forever.

Note that Prometheus 3 introduces breaking changes to the configuration and to PromQL.

PostgreSQL 17

PostgreSQL is expanding its user base, partly thanks to MySQL’s users loss. PostgreSQL 17 was released in September. Let me highlight the most important improvements it has in my opinion, and… the improvements it needs, and doesn’t have.

There are improvements to the IO layer, which make WAL logging and sequential scans faster. Vacuum has also been made faster.

Logical replication now has failover. There is a new tool, pg_createsubscriber, to convert an existing physical standby into a logical replica. Version upgrades using logical replication have been made simpler.

As much as I like PostgreSQL and I welcome PostgreSQL 17, I’d like to see part of its development following a different direction. Optimising vacuum, is an improvement, but vacuum should not be necessary at all. It’s the consequence of choices made in the 80’s, when PostgreSQL development started, and scalability wasn’t needed at all. But nowadays, PostgreSQL should invest more efforts in storage engines support, and develop a modern storage engine that updates rows in-place. This would eliminate the biggest reasons why vacuum is necessary.

Cassandra 5

Apache Cassandra 5.0 was released in September.

JDK version 17 is now supported.

The most important improvement is, in my opinion, Storage-Attached Indexing (SAI). They replace an older feature, Secondary Indexes, which should not be used in production. SAI indexes are local to each node. When the cluster bootstraps or a node is decommissioned, SAI indexes are streamed between nodes, they are not rebuilt by other nodes separately. For each SAI index, two separate structures are created: a Memtable index, and a SSTable index. Data in these structures may frequently conflict with each other, and conflicts are solved during the read path. Each SAI index contains the indexed columns and a reference to the primary key entries, as it happens for secondary indexes in other DBMSs. SAI indexes speed up most read operations against string and numeric data.

SAI indexes also support indexing of vector data, and are able to perform ANN searches.

Dynamic Data Masking (DDM) is also supported. Data is stored normally in Cassandra tables, but is obfuscated on the fly during SELECTs. Several methods to obfuscate data are supported. This includes nullification, type-sensitive replacement, replacement with a constant string, or even partial obfuscation. A column obfuscation method is normally defined in the CREATE TABLE statement. Querying the original data is still possible for users that have the UNMASK privilege.

Various other optimisations were introduced, including Tier Memtables and Tier SSTables.

Databases and AI

The base concept of AI is vectors. Objects analysed by AI are represented as multi-dimensional vectors having an impressively high number of dimensions. I believe that vectors with a million dimensions are common. Imagine these objects as geometrical vectors, in a space with many dimensions. An ANN search takes one vector in input, and returns the closest vectors found. “Closest” is a vague term here, because there are a lot of functions that can be used to calculate the distance between two vectors: the Euclidean distance, the cosine distance, and many more. Some are apparently exotic, such as the Manhattan distance, which returns a walking distance between two objects as if they were located in Manhattan (or any other city with a similar map). But every function is the best for some domains.

In practice, vectors can be almost anything. For example:

  • Vectors can be face images, and can be used to implement facial recognition.
  • Vectors can be words, and can be used to implement agents or bots.
  • Vectors can represent a user behaviour, or products, and can be used to implement a recommendation engine.

To support vector searches, a database needs to support the following features:

  • A VECTOR datatype;
  • Vector indexes;
  • Vector distance functions.

Some databases were built specifically to work with vectors. Examples are TileDB and Pinecone. Some of them might focus on features and optimisations that are important when we work with vectors. But vectors can very well be implemented in a relational database, or in any other database type. Here we’ll focus on existing databases supporting vectors.

pgvector and PostgresML

PostgreSQL doesn’t support any feature related to AI, but it has two great extensions that add AI capabilities: pgvector and PostgresML.

pgvector adds support for a VECTOR type with size. Columns of this type can be indexed. It also adds some additional syntax and functions for various distance functions, like Euclidean (L2), inner product, cosine distance, L1, Hamming distance, and Jaccard distance.

PostgresML also adds support for vectors and indexes. But it’s a more complete solution, that is also able to generate the embeddings, train the model, and tune it after the training.

MariaDB vectors

MariaDB has a more limited vector feature set, and at the time of this writing it hasn’t released an LTS version with vectors. It supports a VECTOR type with size, which can be indexed. It supports functions for the Euclidean distance (L2) and the cosine distance. MariaDB vector indexes use a variation of HNSW.

MariaDB published a github repository that shows how MariaDB vectors can be used to easily develop RAG applications. The tool generates the embeddings from the previously trained documentation contents, using an OpenAI model. When you query the script with a request in English (natural language) it will find the closest embedding, it will send it to OpenAI along with the requests, and will display a natural language response generated by OpenAI.

Mark Callaghan made a benchmark that shows that MariaDB vectors performance is remarkably higher than pgvector’s performance. See the part 1 and part 2 of his post.

MySQL and HeatWave vectors

HeatWave, MySQL cloud offer, supports vector searches.

MySQL 9.0 Community Edition supports the VECTOR type and some vector functions, but vectors cannot be indexed.

This makes MySQL 9l.0 Community Edition vectors unsuitable for any practical use.

Cassandra vectors

Cassandra 5 implemented vectors as part of CASSANDRA-18504. The VECTOR type has a size, and its elements have a type as well (integers or, more commonly, floats). To be able to make a vector search, we need to index a VECTOR column, and we can do that thanks to SAIs (discussed earlier in this post). Vectors can also be passed to, or returned from, user-defined functions. Supported distance functions are Euclidean (L2), cosine and dot product.

To create a vector index and then use it:

CREATE INDEX vec_doc_text
    ON document
    USING 'SAI'
    WITH OPTIONS = {
       'similarity_function': 'DOT_PRODUCT'
    }
;
SELECT doc_title, doc_text
    FROM document
    ORDER BY doc_text ANN OF [0.07, 0.08, 0.09, 0.12, ... ]
    LIMIT 1
;

MindsDB

It’s also worth mentioning MindsDB, a metadatabase for machine learning that we support. MindsDB can be queried via SQL (MySQL or PostgreSQL protocol), a REST API, or the MongoDB query language. Data is retrieved from remote sources, such as databases, REST APIs or data files. A huge variety of data sources is supported. We can run queries about the future. Yes, I mean exactly this: we can ask for data that is not and cannot be there yet, such as our sales in the South East of Asia next year, aggregated by month. MindsDB will use machine learning models to make predictions, and will answer with a normal dataset. It has default models, but we can choose to use a specific model, including paid or free remote services. MindsDB supports many different models.

MindsDB releases are very frequent. In 2024, we contributed some usability improvements to the MySQL integration, in the context of our partnership with MindsDB. The MindsDB team has also implemented many improvements to MindsDB itself, and added many integrations. It’s worth noting that they have added integrations with many object and file storage services, such as S3, Azure Blob Storage, MS OneDrive, and DropBox.

License Changes

The DBMS license change saga started some years ago and it has not finished yet. Let’s see the most important changes this year.

Redis became proprietary

Redis license changed from BSD-3 to Redis Source Available License. Immediately, two forks were created: Valkey, which is under the umbrella of the Linux Foundation, and Redict. Redis Ltd has also acquired Speedb, whose website has disappeared.

Salvatore Sanfilippo, Redis’ author, works for Redis Ltd again. His return might be good for Redis reputation. In a HackerNews comment, Salvatore (nickname: antirez) clarified that he doesn’t oppose Redis Ltd choices, nor any Redis fork. He wrote that he has chosen the BSD-3 license to make commercial forks possible. He also highlighted the importance of choosing a license mindfully. Software authors should not use licenses that permit actions that they will complain about.

Elasticsearch is Open Source again

Elasticsearch made the opposite change. After switching to a source available license in 2021, Elasticsearch became open source again in 2024. Now they use the GNU AGPL 3 license.

OpenSearch is an Elasticsearch fork created by Amazon as a response to the Elasticsearch license change in 2021. Amazon announced that OpenSearch is being donated to the Linux Foundation.

ScyllaDB becomes source available

In December, ScyllaDB has announced important changes. They will only develop the ScyllaDB Enterprise edition. The Open Source (AGPL) edition is being discontinued. The new ScyllaDB releases will be covered by a source available license. The AGPL is not being completely abandoned, though: it will be used for Scylla Manager, which is currently source available.

About the GNU AGPL

The GNU AGPL is recognised as Open Source and Free Software. Though, interestingly, it solves the same problems as source available licenses, posing more or less the same type of restrictions: if you modify the code and make it available as a cloud service (or, more generally, via a network), you are distributing it, so you need to distribute the modified source code as AGPL 3. It essentially extends the meaning of distribution by including network usage.

The reason why AGPL is less frequently used than source available licenses is that there are doubts about its solidity, from a legal perspective. However it is used by several software projects, and Elacticsearch’ decision might bring new trust to this license.

What will we see in the new year?

Short answer: I don’t know. Just like those who tell you exactly what will happen in 2025, I don’t have paranormal powers! 🙂

But I can tell you what I, and Vettabase, will look at closely:

  • In-database AI. I’m not convinced that the current data structure, distance functions, etc, are the same that we’ll use in five years. I want to look closely at innovative technologies and market needs.
  • MariaDB performance improvements and new features. I don’t see a bright future for MySQL, especially in its Community Edition, and MariaDB is the most credible replacement for existing MySQL databases.
  • Awesome innovative databases, as I already mentioned. Some of those technologies look promising. I’m not looking to replace existing battle-tested technologies, but we can integrate them with some of these new projects.

Federico Razzoli

All content in this blog is distributed under the CreativeCommons Attribution-ShareAlike 4.0 International license. You can use it for your needs and even modify it, but please refer to Vettabase and the author of the original post. Read more about the terms and conditions: https://creativecommons.org/licenses/by-sa/4.0/

About Federico Razzoli
Federico Razzoli is a database professional, with a preference for open source databases, who has been working with DBMSs since year 2000. In the past 20+ years, he served in a number of companies as a DBA, Database Engineer, Database Consultant and Software Developer. In 2016, Federico summarized his extensive experience with MariaDB in the “Mastering MariaDB” book published by Packt. Being an experienced database events speaker, Federico speaks at professional conferences and meetups and conducts database trainings. He is also a supporter and advocate of open source software. As the Director of Vettabase, Federico does business worldwide but prefers to do it from Scotland where he lives.

Recent Posts

The Mystery of ProxySQL Galera Writer Switchover!

The Mystery of ProxySQL Galera Writer Switchover!

Writer switchover issues with ProxySQL and Galera can feel like an unsolved puzzle. Recently, I encountered two strange behaviors that revealed underlying issues in how ProxySQL interacts with Galera clusters. In this post, I’ll walk through the unexpected behaviors I...

Enforcing strong passwords for MariaDB users

Enforcing strong passwords for MariaDB users

MariaDB users normally connect using a password. Weak passwords are a common security problem, especially when passwords are generated by humans. However, MariaDB comes with plugins that help validating passwords to make sure they are strong enough. This article is a...

Services

Need Help?  Click Here for Expert Support

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *