In my recent webinar Designing a schema for a data warehouse, I was asked a question that deserves a long answer. I recommended to use IDs generated directly in the data warehouse. In other words, even if you import customer data from a MySQL database and each customer has an ID, don’t import that ID in the data warehouse.
The question I was asked is one of the simplest and yet smartest questions that one can ask: why? The answer should be a bit more articulated, and that’s why I’m writing this article.
IDs can change
This might sound weird: IDs are supposed to never change. And if you use foreign keys, probably your foreign keys prevent IDs from changing (ON UPDATE RESTRICT
).
But reality is often different from a textbook that describes a theoretical situation. Weird things happen, and one has to find dirty (pardon, creative) solutions. The most common example is when a MySQL AUTO_INCREMENT
column reaches the maximum value. This might happen even if the column is BIGINT UNSIGNED
, and not necessarily because you have billions of rows. And if it happens, you’ll find out that holes can’t be reused. The next incremental value must be higher than the highest value in the table. So, you will have to reassign the IDs starting from 1, avoiding holes.
In a case like this, if your data warehouse uses the IDs from MySQL, the rows won’t match anymore.
IDs can be replaced
What happens when an employee leaves the company? Probably, after some time, the corresponding row in the employee
table is deleted. And the problem is that, as more time passes, her ID might be reused for another employee.
Different types and formats
As you probably know, rose is a rose is a rose is a rose. So you might think that an ID is an ID, and that’s all. Well, unfortunately ID is a vague term.
An ID can be in various forms:
- A positive integer number;
- A UUID… and it’s worth noting that there are many official and unofficial types of UUID;
- An arbitrary string, in which case you should know its length and character set;
- A combination of two or more values.
The same database could use different types of IDs in different tables. Some of these types might not be supported by our data warehouse technology. But even if all types are supported, in a data warehouse we need uniformity and efficiency: the same type should be used for all the primary keys, and this type should be relatively small and should not need to be parsed.
In a data warehouse, you want all your IDs to be 8 byte integers, regardless the ID types used in the OLTP database.
A business entity doesn’t match a database row
This is the most fundamental aspect. It is not merely about types or ETL problems, it is more about data definition.
Let’s use a dim_customer
dimension as an example. Before designing our ETL, we need to ask the stakeholder: what is a customer? And, more specifically, what identifies a customer?
“Its MySQL ID” is not a valid answer. That is a technical characteristic that stakeholders shouldn’t even be able to see. They need to answer in terms of business logic. And the answer might vary, depending on our company characteristics and how the stakeholders think about the business. For example:
- For a pizza delivery company, a customer can be identified by a device. This is theoretically incorrect, because different family members might make orders from the same device, and a person might have more than one device. But data analysts might not care about these things.
- Another company in the same sector might see a customer as a combination of phone number and city where the food is delivered. This is theoretically wrong, because people travel. But data analysts might decide that, for the purpose of their statistics, it is a good idea to see the same person as different customers when she orders food from different cities.
- A B2B service company might identify each customer by company name and sector. Companies that operate in different sectors will appear as multiple customers, and this will be on purpose.
Deciding what identifies a customer (or a supplier, or an order, or a marketing campaign) is a business problem. But it has consequences for the ETL process. In all the examples above, a customer won’t match a single ID in a MySQL primary key. So, again, a new ID should be generated in the data warehouse, and shouldn’t be imported from other sources.
Multiple sources of truth
Let’s forget for a moment that there isn’t, or there shouldn’t be, a 1:1 match between your OLTP rows and your rows in a data warehouse dimension. You might have another problem: the same logical entity might be represented in multiple databases. And that’s more common than you think. Here’s an example.
Most probably, your company uses an ORM. Leads found by sales representatives are only in the CRM. Customers have access to your customers portal, so they’re in the MySQL database used by the portal. They’re imported into the CRM once a month automatically, which means that newest customers are only in MySQL. But you want to have all customers in your data warehouse, including the most recent.
To make things worse, some details might be inconsistent between different sources. This might happen because the CRM automatically retrieves a company data incorrectly, because your sales representatives replace a phone number that was no longer active, or because your customers enter a wrong number to avoid being contacted. You have multiple versions of the truth, and establishing which one is correct can be quite complex. Sure, that is the purpose of Master Data Management, but such solutions are becoming increasingly less common.
If you import an ID to your data warehouse, you’ll have to choose which source of truth it comes from. This is conceptually wrong, and could even make ETL harder. Furthermore, if all sources of truth contain mistakes, you are conceptually associating a dimension to those mistakes.
Conclusions
Importing the IDs from OLTP databases to a data warehouse is a common mistake. Here we discussed the reasons why you shouldn’t do that. If you need help building a data warehouse, consider our services for Data Analysts.
Federico Razzoli
0 Comments