Last updated on 17 October 2021

as Escher’s cube
There have been endless discussions already if
NULL
should be implemented in relational databases, and if it should even be included in the relational theory. My favourite quote on this subject is from the person who helped Ted Codd to divulgate – and probably refine – his relational algebra:
I hate NULL.
Chris Date
I tend to hate it too, except that sometimes it is necessary. But my point is totally different from Date’s. I have nothing against the idea of
NULL
, I just think that it is wrong by design in SQL.
Because
NULL
has two incompatible meanings, and depending from the context, the SQL standard and implementations randomly choose one of them.
Note: SQL:2011 (and probably other SQL standard versions) defines
NULL
as the absence of any data value. But this definition is not consistent with the semantics specified in the document itself.
The theory
Ted Codd didn’t propose anything corresponding to
NULL
in his fist version of the relational algebra. In a later document, he proposed two markers. A marker is something that is placed where you’d expect a value, but is not a regular value. These markers were:
- I-MARKER: Non-applicable, absent value;
- A-MARKER: Applicable but unknown value.
This first proposal has very little to do with SQL design choice of merging these markers into one. Later, Codd himself wrote something about
NULL
in his famous 12 rules:
Rule 3: Systematic treatment of null values:
Null values (distinct from the empty character string or a string of blank characters and distinct from zero or any other number) are supported in fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type.
But, as you can see, he only mentioned missing information and inapplicable information, so he didn’t seem to accept that
NULL
could include unknown values (unless he implied that a value is missing because it’s unknown, but the 12 rules don’t seem to state this).
Personally, I tend to believe that Codd’s original idea (two markers) would be a bit overcomplicated for most common practical cases. But it is consistent – actually, his purpose was to elaborate a mathematically consistent system, and he did.
SQL doesn’t have the original theory’s consistency, but it is more practical. Yet, when it comes to
NULL
, I think it should have chosen one of the following: implementing both the markers, implementing only one of them with consistent semantics, or implementing none.
SQL’s inconsistency
Again: my point is that SQL merged two ideas into one, and mixed their semantics in an inconsistent way, so the meaning of
NULL
depends on the context. I will describe the behaviors of
NULL
in different contexts.
Comparisons
NULL = NULL
returns
NULL
. In a previous article I explained how to properly compare values that could be
NULL
.
This makes perfectly sense if
NULL
is unknown. Is an unknown value equal to an unknown value? The answer is unknown: they could be the same value or not.
But if
NULL
is a missing value, this is puzzling. Is a missing value identical to another missing value? The answer in my opinion should be yes. But for sure, the answer shouldn’t be a missing value itself.
However, consider this expression:
null_column >= 100 OR null_column < 100
It returns
NULL
. This defies the idea that comparisons make sense if
NULL
is an unknown value. Comparisons are simply inconsistent in all cases.
NULL and UNKNOWN
For
NULL
-safe comparisons, the
IS NULL
and
IS NOT NULL
operators can be used. They are synonyms for
IS UNKNOWN
and
IS NOT UNKNOWN
, which suggests that
NULL
is an unknown value.
Scalar expressions
Scalar expressions seem to consider
NULL
as an unknown value. Scalar subqueries are a different matter, see below. For example:
1 + NULL = NULL
If
NULL
was a missing value, it would behave exactly as 0, so the above expression would return 1.
Aggregate functions
Aggregate functions are those that accept any number of 1-column rows and return exactly one value.
COUNT(*)
returns the number of rows. If a row only contains
NULL
s, it still counts as one. So, in this context,
NULL
is considered as an unknown existing value.
COUNT(column)
ignores
NULL
s. This means that, in this context,
NULL
is considered as a missing value.
For a funny fact about
COUNT()
and
NULL
, see this old article by Baron Schwartz.
Other aggregate functions consider
NULL
as a missing value – so it is ignored by
MIN()
,
MAX()
,
AVG()
, etc.
GROUP BY, DISTINCT
GROUP BY
and
DISTINCT
treat
NULL
as a regular value.
This could be consistent with both interpretations of
NULL
.
ORDER BY
Some DBMSs return
NULL
s before values, others return values first. I’d like to see more consistency, but in both cases it is just a practical choice that does not contradict any interpretation of
NULL
.
Outer joins
Left, right and full outer can return rows that have no match in the other table. When this happens, columns from the other table are populated with
NULL
s.
This behaviour treats
NULL
s as absent values.
It is interesting to note that, while in most cases this is just fine, sometimes this behaviour is ambiguous. Consider the following query:
SELECT t1.a, t2.b
FROM t1 LEFT JOIN t2
ON t1.a = t2.b OR t2.b IS NULL;
In the results, it is impossible to distinguish an absent match in
t2
from a match in
t2
where the
b
column is
NULL
.
This shows that, even if
NULL
represents an absent value, this is not enough to answer joins properly, as it doesn’t say anything about why the value is missing.
Scalar subqueries
Scalar subqueries are nested queries that return one row consisting of one column. In practice, they return a single value, or
NULL
in case no value was found. And this is the trap: normally, you don’t know if a subquery returned
NULL
because it found it in a table, or because it found nothing.
As mentioned for joins, at least one more marker would be needed to eliminate ambiguity.
Subqueries in the form
WHERE val IN (SELECT ...)
return
NULL
if the value is not found but the subquery returns at least one
NULL
. Subqueries in the form
WHERE EXISTS (SELECT ...)
returns 1 even if the subquery only returns
NULL
s. This makes sense if
NULL
means existing unknown value.
Unicity
SELECT DISTINCT
always treats
NULL
as a normal value. This seems to me coherent with considering
NULL
as a missing value. But it is incoherent with the idea of treating it as an unknown value: two unknown values may be identical or not.
Primary keys don’t allow
NULL
columns. Primary keys logically identify each row, this is one of the pillars of relational algebra. So this is consistent with treating
NULL
as an unknown value. But if the primary key consists of multiple columns, it should be allowed to store
NULL
markers – as long as no row only consists of
NULL
s and unicity is certain. Of course this is just a theoretical objection, since a practical implementation would probably be not desirable.
Most DBMSs allow to insert any number of
NULL
s in a
UNIQUE
index. This is consistent with treating
NULL
as a missing value: an absence doesn’t duplicate another absence.
Other DBMSs only allow one
NULL
per duplicate index. This is not consistent with any
NULL
interpretations. Not with absences, as mentioned earlier. And not with unknown values: an unknown value could duplicate another unknown value, but it could also duplicate any known value – the chances are exactly the same. So
NULL
shouldn’t be allowed if the table contains more than one row.
Some DBMSs, like Db2, don’t allow any
NULL
value in
UNIQUE
indexes. This can be a good idea for any interpretation of
NULL
, as it’s the only way to be sure that no constraint is violated.
Other meanings of NULL
NULL
also has a couple of additional meanings:
- Any value;
- Default value;
- Placeholder.
Default
The meaning is default when we create a sort of “default row”. For example, we could have a table
user_setting
, with the columns
user_id
,
setting_name
and
setting_value
. When a user changes a preference, a row is created. But all preferences have a default value, which is used by all users who didn’t change it. The rows with the default values will have a
user_id
set to
NULL
.
This technique is not particularly efficient. With many DBMSs, this event prevents us to create a primary key on (
user_setting
,
setting_name
). It’s usually better to use a special value, like 0 or -1 – which is not special for the DBMS, but it’s special for us.
We can also use
NULL
with meaning in stored procedures. When
NULL
is passed as an argument, the procedure will assign a default value.
Another example is variables. In most DBMSs that support some type of variables, an unset variable is
NULL
.
Any value
Sometimes a row represents a fact that is true for any value. You can translate it into English as “any object”, which is usually equivalent to “all objects”.
For example, your website may have contents that appear on multiple pages. You may have a table for those contents, which has a
page_id
column. Where this columns is set to
NULL
, the content is included in all pages.
Placeholder
Sometimes one of more attributes are not know at the time a row is inserted. So some applications use
NULL
as a placeholder. In this case the value that will be inserted may be known by other applications, but not by the database itself: therefore, we can consider this as a special case of using
NULL
to indicate an unknown value.
With one caveat: in some situations we don’t know if a value actually exists. A practical implication is that, in such cases, we shouldn’t rely on some of the features discussed above (
COUNT(column)
,
EXISTS
…), because they assume that the value exists.
NULL and DML
The way
NULL
is used in
CREATE TABLE
and
ALTER TABLE
statements is quite strange. This does not directly affect the semantics meaning of
NULL
, but it is worth mentioning here.
NULL and NOT NULL
It is possible to specify
NULL
or
NOT NULL
to determine if a columns should allow
NULL
s or not. This is absolutely reasonable.
What I consider less reasonable is that, if we don’t specify this clause,
NULL
s will be permitted. Whichever meaning you associate to
NULL
, it doesn’t make sense for most columns.
Even worse, there is an exception: for columns that are part of the primary key,
NOT NULL
is the default. This is surely confusing for whoever is not familiar with DML.
In my opinion,
NULL
or
NOT NULL
should always be specified.
[NOT] NULL and DEFAULT clauses
Columns may have
DEFAULT
values or not. The
DEFAULT
value applies if no value was specified in the
INSERT
statement, while it has no effect during an
UPDATE
. I have absolutely nothing against this.
But what happens if no value is specified on
INSERT
for a column that doesn’t have a
DEFAULT
value? Here’s the funny part: it depends if the column can contain
NULL
s or not.
If it is allowed,
NULL
is inserted. If it is not, the client will receive an error. In my opinion, this difference is confusing. It is also error prone: if a column is not specified, it is possible that the author of the query simply forgot it. The only exceptions should be the ones for which a
DEFAULT
was defined.
About specific implementations
Every SQL implementation is different, and none of them reflects a big portion of the standard. A couple of them deserve some words.
All of the below oddities is documented and should be considered as a design choice, not a bug.
PostgreSQL
PostgreSQL has several markers, or special values.
NULL
is special even between them.
SELECT FLOAT8 'Infinity' = FLOAT8 'Infinity';
This expression returns
true
, but this is wrong. Some infinities are bigger than others. PostgreSQL allows us to talk about infinity, but it treats is like a regular number.
If you consider
NULL
as an absence, the result is funny. The number of points in a plane is the number of points in a straight line, but two absences are not equal.
Oracle
Oracle is fantastic in this respect. Sometimes it treats
NULL
as inconsistently as standard SQL. Sometimes it treats it as… an empty string! Despite this, the documentation says:
Do not use null to represent a value of zero, because they are not equivalent.
I’m not going to dig more into this absurdity, so for more info see Oracle documentation.
SQLite
SQLite is kind enough to let us decide if
NULL
should be treated as one value in
SELECT DISTINCT
or not. By default it is. To change this behaviour, we can change the
NULL_ALWAYS_DISTINCT
macro and recompile the code.
See also
Related articles
Federico Razzoli