Schedule Meeting

a

Primary keys in Data Warehouses

by | Jul 8, 2024 | Data warehousing

Need Help?  Click Here for Expert Support

In my recent webinar Designing a schema for a data warehouse, I was asked a question that deserves a long answer. I recommended to use IDs generated directly in the data warehouse. In other words, even if you import customer data from a MySQL database and each customer has an ID, don’t import that ID in the data warehouse.

The question I was asked is one of the simplest and yet smartest questions that one can ask: why? The answer should be a bit more articulated, and that’s why I’m writing this article.

IDs can change

This might sound weird: IDs are supposed to never change. And if you use foreign keys, probably your foreign keys prevent IDs from changing (ON UPDATE RESTRICT).

But reality is often different from a textbook that describes a theoretical situation. Weird things happen, and one has to find dirty (pardon, creative) solutions. The most common example is when a MySQL AUTO_INCREMENT column reaches the maximum value. This might happen even if the column is BIGINT UNSIGNED, and not necessarily because you have billions of rows. And if it happens, you’ll find out that holes can’t be reused. The next incremental value must be higher than the highest value in the table. So, you will have to reassign the IDs starting from 1, avoiding holes.

In a case like this, if your data warehouse uses the IDs from MySQL, the rows won’t match anymore.

IDs can be replaced

What happens when an employee leaves the company? Probably, after some time, the corresponding row in the employee table is deleted. And the problem is that, as more time passes, her ID might be reused for another employee.

Different types and formats

As you probably know, rose is a rose is a rose is a rose. So you might think that an ID is an ID, and that’s all. Well, unfortunately ID is a vague term.

An ID can be in various forms:

  • A positive integer number;
  • A UUID… and it’s worth noting that there are many official and unofficial types of UUID;
  • An arbitrary string, in which case you should know its length and character set;
  • A combination of two or more values.

The same database could use different types of IDs in different tables. Some of these types might not be supported by our data warehouse technology. But even if all types are supported, in a data warehouse we need uniformity and efficiency: the same type should be used for all the primary keys, and this type should be relatively small and should not need to be parsed.

In a data warehouse, you want all your IDs to be 8 byte integers, regardless the ID types used in the OLTP database.

A business entity doesn’t match a database row

This is the most fundamental aspect. It is not merely about types or ETL problems, it is more about data definition.

Let’s use a dim_customer dimension as an example. Before designing our ETL, we need to ask the stakeholder: what is a customer? And, more specifically, what identifies a customer?

“Its MySQL ID” is not a valid answer. That is a technical characteristic that stakeholders shouldn’t even be able to see. They need to answer in terms of business logic. And the answer might vary, depending on our company characteristics and how the stakeholders think about the business. For example:

  • For a pizza delivery company, a customer can be identified by a device. This is theoretically incorrect, because different family members might make orders from the same device, and a person might have more than one device. But data analysts might not care about these things.
  • Another company in the same sector might see a customer as a combination of phone number and city where the food is delivered. This is theoretically wrong, because people travel. But data analysts might decide that, for the purpose of their statistics, it is a good idea to see the same person as different customers when she orders food from different cities.
  • A B2B service company might identify each customer by company name and sector. Companies that operate in different sectors will appear as multiple customers, and this will be on purpose.

Deciding what identifies a customer (or a supplier, or an order, or a marketing campaign) is a business problem. But it has consequences for the ETL process. In all the examples above, a customer won’t match a single ID in a MySQL primary key. So, again, a new ID should be generated in the data warehouse, and shouldn’t be imported from other sources.

Multiple sources of truth

Let’s forget for a moment that there isn’t, or there shouldn’t be, a 1:1 match between your OLTP rows and your rows in a data warehouse dimension. You might have another problem: the same logical entity might be represented in multiple databases. And that’s more common than you think. Here’s an example.

Most probably, your company uses an ORM. Leads found by sales representatives are only in the CRM. Customers have access to your customers portal, so they’re in the MySQL database used by the portal. They’re imported into the CRM once a month automatically, which means that newest customers are only in MySQL. But you want to have all customers in your data warehouse, including the most recent.

To make things worse, some details might be inconsistent between different sources. This might happen because the CRM automatically retrieves a company data incorrectly, because your sales representatives replace a phone number that was no longer active, or because your customers enter a wrong number to avoid being contacted. You have multiple versions of the truth, and establishing which one is correct can be quite complex. Sure, that is the purpose of Master Data Management, but such solutions are becoming increasingly less common.

If you import an ID to your data warehouse, you’ll have to choose which source of truth it comes from. This is conceptually wrong, and could even make ETL harder. Furthermore, if all sources of truth contain mistakes, you are conceptually associating a dimension to those mistakes.

Conclusions

Importing the IDs from OLTP databases to a data warehouse is a common mistake. Here we discussed the reasons why you shouldn’t do that. If you need help building a data warehouse, consider our services for Data Analysts.

Federico Razzoli

 

All content in this blog is distributed under the CreativeCommons Attribution-ShareAlike 4.0 International license. You can use it for your needs and even modify it, but please refer to Vettabase and the author of the original post. Read more about the terms and conditions: https://creativecommons.org/licenses/by-sa/4.0/

About Federico Razzoli
Federico Razzoli is a database professional, with a preference for open source databases, who has been working with DBMSs since year 2000. In the past 20+ years, he served in a number of companies as a DBA, Database Engineer, Database Consultant and Software Developer. In 2016, Federico summarized his extensive experience with MariaDB in the “Mastering MariaDB” book published by Packt. Being an experienced database events speaker, Federico speaks at professional conferences and meetups and conducts database trainings. He is also a supporter and advocate of open source software. As the Director of Vettabase, Federico does business worldwide but prefers to do it from Scotland where he lives.

Recent Posts

Data warehouse benefits and use cases

Data warehouse benefits and use cases

Recently I talked to many companies about this topic: why a data warehouse is essential to support company decisions, in all departments. I decided to summarise my points in this article, for future reference. The need for centralised information All company...

Services

Need Help?  Click Here for Expert Support

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *