Redshift is an Amazon database technology available for AWS customers. It is based on PostgreSQL 8. The codebase was modified to develop a horizontally scalable SQL DBMS, suitable for analytics, and in particular Data WareHouse.
Many PostgreSQL features are not available in Redshift, particularly those that are not suitable for a database cluster or for analytics, and those that Amazon didn’t consider important enough to spend the necessary resources to maintain them. Some PostgreSQL features were added later, probably after being reimplemented by the Redshift team.
This article shows an incomplete list of PostgreSQL features that are missing in Redshift. It is mainly written for those who are familiar with PostgreSQL and are evaluating Redshift for analytics. Note that I’m trying to maintain this article up to date over time.
Main missing features
Let me stress it again: this is not meant to be a complete list.
The code
Redshift is not open source, nor source available.
PostgreSQL is open source, but its license does not disallow Amazon from using its codebase to build a cloud database.
Ownership
One great thing about PostgreSQL is that it is not developer or owned by a company. Instead, many organisations and individuals contribute to improve it and have interest in keeping it alive.
Redshift’s future appears solid because there is interest around it, and its owner is solid. Still, it depends on a single company. As a corner case, if Amazon’s plans change for any reason, Redshift may be retired at some point. And since it is not open source, no one would be able to keep it alive without an agreement with the vendor.
Latency
Amazon Redshift has a good throughput, and it can optimise some big queries with many joins by running them in parallel through the cluster.
However, the latency is far too high and unpredictable for OLTP workloads. If a simple query that would run in less than 0.1 on PostgreSQL is run against Redshift, it could very well take 0.5 seconds, or sometimes even 1 second. Data analysts don’t care about such a small wait, but for an OLTP workload it would be a disaster.
Redshift is only intended for analytics. Don’t try to use it for different purposes.
SELECT *
Redshift stores data in a columnar fashion. As a consequence, selecting all columns (SELECT *
) or many columns causes performance penalties.
Extensions
Redshift does not support the concept of extensions.
Data types
Redshift does not support certain types, like:
- range types;
json
– but there are JSON functions;xml
;bit
andbytea
;- network types;
UUID
.
User-defined types, pseudo-types and domains are also not supported.
Indexes
Redshift allows a DISTKEY
and a SORTKEY
for each table (very similar to Cassandra primary keys), but there are no secondary indexes.
Even for those keys, PostgreSQL ASC and DESC options, index types, partial indexes, expressional indexes, non-key columns are not available. Generated columns are not available.
Constraints
Constraints like primary key uniqueness, UNIQUE
constraints and foreign keys are not enforced. Instead, Redshift assumes that they are enforced by the client application, so for example:
SELECT DISTINCT pk
will return the same results asSELECT pk
.- If there is a primary key or a
UNIQUE
key,COUNT(DISTINCT *)
will return the same number asCOUNT(*)
.
This may cause queries to return wrong results.
CHECK
constraints, ASSERTION
s and triggers are not supported.
Inheritance
Not supported.
Foreign Data Wrappers
Foreign data wrappers are not supported in Redshift.
Though, it has external tables to use data in some specific formats stored in Amazon S3. External tables are populated from the results of a query. The UNLOAD
statement can be used to export data to Apache Parquet files in S3.
Transactions
Transactions have the following important limitations:
- The only supported isolation level is
SERIALIZABLE
; SAVEPOINT
s are not supported;TRUNCATE
andVACUUM
implicitly commit the current transaction.
But again, keep in mind that Redshift is designed for data warehousing. It allows to:
- Load data while users run queries;
- Load data concurrently into different tables;
- Alter a dimension structure and its data inside the same transaction.
SQL
The following statements (between others) are missing:
MERGE
;INSERT ... ON CONFLICT
;COPY
has less options and more specific semantics;RETURNING
;CHECKPOINT
;SECURITY LABEL
;LISTEN
;VALUES
.
Stored procedures and functions
- Stored procedures can only be written in PL/plSQL or Python. Other procedural languages are not supported.
- Stored functions are very limited in Redshift:
- Functions can only be a single
SELECT
or a Python script. Other languages, including PL/plSQL, are not supported. - Functions can’t read data from tables (they’re
SELECT
s without aFROM
clause). - Functions can only use positional arguments. There are no named arguments.
- Functions can only be a single
There are several other minor limitations for both procedures and functions. For example VARIADIC
arguments are not supported, and there is no COMMENT
clause.
Security features
Missing security features:
- PAM authentication;
- Column-level encryption at rest.
Conclusions
I used Redshift in the past and I will probably use it again. Let’s be clear: as a general rule, I prefer to use open source technologies, and in particular I don’t like to depend on a single vendor. But many organisations are already depending on Amazon, and don’t have any plans to change that.
Redshift is not easy because it’s in the cloud. It’s not something easy to switch to because it’s just good ol’ Postgres. You will need to learn things like how keys work and compression. Its costs are also potentially high (something not covered in this article).
All that said, would we recommend it? Sometimes we would. The main reason for that is integration with other Amazon services, where lock-ins are considered acceptable.
Federico Razzoli
0 Comments