Schedule Meeting

a

The benefits of MariaDB ColumnStore

by | Feb 22, 2024 | ColumnStore

Need Help?  Click Here for Expert Support

Last week Richard announced our projects on MariaDB ColumnStore. Take a look at his great post, if you didn’t already.

Since then, I’ve got some questions from customers, colleagues and friends: why did you guys decide to robustly invest into ColumnStore, and offer your ColumnStore services? After all, it’s not a new or trendy technology. So, why ColumnStore and not, for example, the latest NoSQL database?

The main reason is pretty simple: MariaDB ColumnStore can bring huge benefits to our customers. In this article I’m going to elaborate the main advantages of ColumnStore, from both a technical and a strategic perspective.

An analytics technology based on MariaDB

The first point is about avoiding the introduction of too many technologies in companies that, typically, already use too many technologies.

Just in case you don’t understand what I’m talking about… make a list of the technologies used in your company, or by your team. Include everything that plays a key role: databases, load balancers, caches, object storages, and so on. Each of them should have proper monitoring and alerts, automated backups, an upgrade policy, people able to troubleshoot in depth, inventory/documentation, and more. But the budget is often too low, the team is often too small, and in the end this just doesn’t happen.

ColumnStore is a technology for analytics and data warehouse that runs on top of MariaDB, which is an operational database. Using MariaDB for OLTP and MariaDB ColumnStore for OLAP allows the reuse of people skills. Several tools can be shared: monitoring, ProxySQL, Ansible roles and more. Data pipelines can be greatly improved by using MariaDB replication. The use of the MariaDB CONNECT engine can also simplify ETL processes from other data sources.

Distributed and scalable

MariaDB ColumnStore has a distributed, highly scalable architecture. ColumnStore architecture consists of the following parts:

  • ColumnStore nodes are MariaDB servers running the ColumnStore storage engine, which spawns several process responsible for both horizontal and vertical scalability.
    • When a node receives a query from a user, the ExeMgr process is responsible for evaluating its steps and distributing them across the cluster.
    • Other nodes’ PrimProc process will receive requests from the node that received the query, and perform part of the job. For example, when a big aggregation needs to be performed, typically each node aggregates a part of the data. The initiator node will receive all partial aggregation, and will run a final aggregation.
  • S3 and compatible services (like Google Storage or the open source MinIO) can be used to store data;
  • Shared storge can be used to achieve High-Availability (HA).

We can add Performance Nodes to improve a cluster IO capacity, and User Nodes to be able to run more complex queries at the same time.

This architecture allows us to store Petabytes of data (compressed, handled by Performance Nodes), and answer complex queries on billions of rows in seconds (User Nodes).

Columnar, massively parallel

Even on a single node MariaDB ColumnStore massively scales up taking full advantage of the CPUs.

As ColumnStore name suggests, it is a columnar technology. But this is a simplification. The traditional difference between row-based and columnar architectures is that the latter typically stores each table column in a different file, to allow fast aggregations and better compression. But, in order to scale better on each node, ColumnStore has a more complex storage design, in order to store big amounts of data while making sorting and grouping fast and reducing contention.

Column data is split into partitions, that is, logical blocks that contain a range of values. Partitions are stored in big units called extents. Typically, each file contains up to two extents from the same column. There are no indexes; instead, an extent map indicates the location of each partition and the lowest value it contains.

ColumnStore has a pool of threads that remain running even when not in use. When PrimProc receives a request, it will split the job into multiple parts and each thread will run one of these parts, in parallel. So jobs are split not just over multiple nodes, but even over multiple threads within each node.

If an S3-compatible service is used as main storage, all Performance Nodes have access to it, but a local cache exists on a shared storage device.

Whether S3 is used or not, a shared device allows high availability. Each node has a corresponding directory in the shared storage, but each directory is mounted on all nodes.

MariaDB ColumnStore is designed for intensive reads with occasional huge data updates, which is typical of OLAP databases. So the locking system is minimal, and designed to serve this type of usage without reducing scalability. So reads are not locking, and no operation blocks them (not even ALTER TABLEs). Writes acquire locks on whole tables, rather than locking every modified row.

Excellent performance

Let me stress this again:

  • Data size: ColumnStore can handle petabytes of compressed data on a shared storage;
  • Query speed: ColumnStore performs quite complex joins and aggregations in seconds, whereas regular MariaDB would take minutes, or in minutes, whereas regular MariaDB would take hours or days.

This is the result of MCS super scalable architecture illustrated above.

Cross-engine joins

Each MariaDB node sees ColumnStore as a regular storage engine. All it knows is that when data needs to be written into a ColumnStore table or read from it, MariaDB needs to call the proper methods of the ColumnStore storage engine API.

As a result, almost all MariaDB SQL syntaxes work on ColumnStore tables. There are some exceptions, but they’re not very relevant.

More importantly, an SQL statement can involve ColumStore tables and tables built with any other engine. This opens up new scenarios that would be impossible if you use different technologies for OLTP and for analytics. Some examples:

  • Use the CONNECT engine to import data from other DBMSs, data files, REST APIs, and more. Just run an INSERT SELECT statement from a cron job, skip the most complex parts of ETL processes.
  • Incrementally import data from a sharded OLTP database with SPIDER.
  • Store analytics data into ColumnStore tables designed to answer analytical queries. But when needed, you’ll also be able to join those tables with regular OLTP tables, replicated in real time from production. Say goodbye to “the data ingestion process failed again last night, we only have data from the past month”.
  • Star schemas or snowflake with ColumnStore fact tables and InnoDB dimensions. Dimensions can be temporal tables, and they can be updated in a fully transactional fashion.
  • Archive historical data with the S3 engine, and free up your online storage. If one day you need to query that data again, just run a regular SQL query. It will be slow, but much faster than opening a ticket for your data engineers to recreate a table with data that is archived in S3.
  • Use SPIDER to write back data from ColumnStore to your OLTP databases. This makes sense for statistical queries that would take hours or days on regular InnoDB tables.

Integrations

For the reasons explained above, technologies that integrate with MariaDB will also work with ColumnStore. Some relevant examples for data analysis professionals are:

  • All Desktop and Web UIs that support MariaDB will work with ColumnStore.
  • Any script that works with MariaDB will work with ColumnStore, unless you hit some ColumnStore limitations.
  • Stored procedures for MariaDB will work with ColumnStore.
  • Stored procedures for Oracle work with MariaDB (unless they use Oracle built-in packages or uncommon syntax). They will also work with ColumnStore.
  • MindsDB can use MariaDB ColumnStore as a data source.
  • ClickHouse is able to read data from MariaDB ColumnStore.

Open source license

ColumnStore (as part of the MariaDB community edition) is distributed under the terms of the GNU GPL, version 2. There are no license costs.

No lock-ins

MariaDB ColumnStore, community edition, won’t bind you to any particular vendor:

  • Vettabase offers MariaDB ColumnStore consulting and is part of the community, but we’re not involved in the product in any other way.
  • Our tools, including Docker and Vagrant images, do not include any lock in.
  • The community edition won’t force you to buy any service from the vendor, MariaDB plc.
  • MariaDB ColumnStore can run on-premise on Linux systems, and on any cloud provider as a non-managed solution (for example, you can install it on AWS Ec2, Google GCE, or Microsoft Azure VMs).
  • Thanks to the CONNECT engine, MariaDB ColumnStore integrates with technologies using open protocols like ODBC, JDBC or HTTP, and can import file formats including JSON, XML, CSV, HTML, as well as other more obscure and even, custom formats.

Conclusions

MariaDB ColumnStore is an outstanding solution for analytics. This is because of its peculiar features built to scale OLAP workloads, and because it’s based on MariaDB, one of the most widely used DBMSs for OLTP. And it’s open source, and free.

To begin with, you can read our unofficial documentation and try ColumnStore as a single node on your laptop with our Vagrant or Docker image.

Do you need help to evaluate ColumnStore for your specific use case? Do you need help to deploy and configure it? Do you need help with data integration? Or maybe a training for your data analysts? Contact us to discuss your needs!

Federico Razzoli

All content in this blog is distributed under the CreativeCommons Attribution-ShareAlike 4.0 International license. You can use it for your needs and even modify it, but please refer to Vettabase and the author of the original post. Read more about the terms and conditions: https://creativecommons.org/licenses/by-sa/4.0/

About Federico Razzoli
Federico Razzoli is a database professional, with a preference for open source databases, who has been working with DBMSs since year 2000. In the past 20+ years, he served in a number of companies as a DBA, Database Engineer, Database Consultant and Software Developer. In 2016, Federico summarized his extensive experience with MariaDB in the “Mastering MariaDB” book published by Packt. Being an experienced database events speaker, Federico speaks at professional conferences and meetups and conducts database trainings. He is also a supporter and advocate of open source software. As the Director of Vettabase, Federico does business worldwide but prefers to do it from Scotland where he lives.

Recent Posts

Coming up next, a ColumnStore webinar

Coming up next, a ColumnStore webinar

The 27th is fast approaching but you can still signup to our live webinar where we will be exploring the benefits of time series data using MariaDB ColumnStore. We will be jumping into a live demo with some example data and queries simulating high volume time series...

Let’s go, MariaDB ColumnStore at Vettabase!

Let’s go, MariaDB ColumnStore at Vettabase!

I have been an avid user of Columnar based storage engines for as long as I have been a DBA. The advantage is instantly obvious once you and your teams start to write or convert queries which previously took tens of minutes, are now returning in fractions of a second....

Services

Need Help?  Click Here for Expert Support

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *