Last updated on 12 March 2021
Optimizing read and write performance varies for every Postgres database server in a different environment. Even if we decide to increase the
parameter for startup A Postgres server, it may not be wise to do so for startup B.
In addition, we need some understanding of how the Postgres planner decides how to execute a specific query, in order to write an efficient SQL to boost read and write performance.
In this post, we’re looking at basic ways to enhance read and write performance for the Postgres server in different scenarios. The rules listed in this post are generic recommendations that you can apply to your Postgres servers, according to your own needs.
Before we take a look at how to improve read and write performance for a Postgres server, we’ll discuss some facts about databases and query execution components of PostgreSQL.
Database facts and query execution components
Some facts and some details about PostgreSQL query planner that we should know, in order to optimize our PostgreSQL servers.
Fact 1: Less Write, More Read
“A well-balanced memory cache makes your customers happy”
Let’s take a look at an imaginary example: Ticket Air, a ticketing platform uses PostgreSQL to save customers’ data. This fictional company relies on
, and `
` provided by Elasticsearch to build a faster search solution for their customers searching for the cheapest flight to destinations such as Tel-Aviv, Budapest and so on.
A customer who recently registered on ticketing.il is looking for the cheapest flight to Tel Aviv. In this case, we should not expect the application to perform WRITE operation on behalf of the customer.
The web application takes the customer’s query and searches through the flight tables via the price Column for affordable prices. A few instants later, the following results are displayed to the customer who queried for it.
Another set of customers requests for a similar query. If there is enough memory for the PostgreSQL cache to store the previous results, it is fetched directly from the cache without accessing the disk.
Fact 2: Database configuration is not fixed
Configuration settings of a PostgreSQL server in an aviation industry differ from that of social media site. The same parameters such as
, are utilized to configure a database server. However, the values assigned to these parameters to optimize WRITE/READ performance differ based on the core functionality of a web service or a web application.
Query execution components
- Parser: It examines or checks the syntax of a SQL string.
- Rewriter: In some instances, we prefer to query against a view instead of querying directly against tables. The function of the Rewriter component is to modify it to optimize the query.
- Planner: It is one of the most important components. It generates a plan to execute a specific query.
- Executor: It works together with the Planner component to execute the plan generated by the planner.
Analyzing the execution plan by the planner
The planner is responsible for coming up with a plan to execute a query. The planner decides the plan of execution based on the type of query. Knowing how the planner decides a plan for a specific query helps in writing efficient queries.
Assuming there are two tables for the national employee database. The
table stores the names of all employees working for both private and public companies in Israel whilst the
table stores the list of departments where every employee in the employee’s tables.
CREATE TABLE emp ( emp_id serial PRIMARY KEY, first_name varchar(50), last_name varchar(50) ); first_name | last_name | emp_id ------------+-----------+-------- Ben | Cohen | 1 Eitan | Lahav | 2 Dror | Hevroni | 3 Shira | Eilat | 4
CREATE TABLE department ( id serial PRIMARY KEY, dpt_name varchar(50), dpt_location varchar(50), CONSTRAINT fk_id FOREIGN KEY (id) REFERENCES emp (emp_id) ); id | dpt_name | dpt_location ----+-------------+-------------- 1 | accounting | chicago 2 | engineering | tel-aviv 3 | HR | eilat 4 | research | eilat
Now let’s find out what decision the planner will take to execute the following query using the
command displays the execution plan generated by the Postgres planner whilst the
actually executes the query in addition to displaying the execution plan.
Accessing a single table
EXPLAIN SELECT first_name from emp WHERE last_name = "Eilat"; EXPLAIN ANALYSE SELECT first_name FROM emp WHERE last_name = 'Eilat';
This query selects the
of every employee whose
Seq Scan on emp (cost=0.00..13.75 rows=2 width=118) Filter: ((last_name)::text = 'Eilat'::text)
Seq Scan on emp (cost=0.00..13.75 rows=2 width=118) (actual time=0.019..0.020 rows=1 loops=1) Filter: ((last_name)::text = 'Eilat'::text) Rows Removed by Filter: 3 Planning Time: 33.502 ms Execution Time: 0.053 ms
The result of both
commands show that in the first example the Postgres planner scans the
table using the sequential scan method.
Now let’s take another look at how the Postgres planner generates an execution plan for the second statement below:
EXPLAIN ANALYSE SELECT * FROM emp WHERE emp_id = 4; EXPLAIN SELECT * FROM emp WHERE emp_id = 4;
Result of the
Index Scan using emp_pkey on emp (cost=0.15..8.17 rows=1 width=240) (actual time=0.867..0.870 rows=1 loops=1) Index Cond: (emp_id = 4) Planning Time: 89.925 ms Execution Time: 12.548 ms
Result of the
Index Scan using emp_pkey on emp (cost=0.15..8.17 rows=1 width=240) Index Cond: (emp_id = 4)
In this case, the Postgres planner generated an execution plan for the statement using the Index scan.
Accessing Two Tables
Now let’s find out which method the Postgres planner is going to use when accessing data from two tables:
Assuming we issued a query like the one below to find employees whose department is located in Eilat with both
EXPLAIN SELECT first_name FROM emp, department WHERE emp.emp_id = department.id AND department.dpt_location = 'eilat';
Result of the
Hash Join (cost=13.78..27.57 rows=2 width=118) (actual time=0.051..0.054 rows=2 loops=1) Hash Cond: (emp.emp_id = department.id) -> Seq Scan on emp (cost=0.00..13.00 rows=300 width=122) (actual time=0.016..0.017 rows=4 loops=1) -> Hash (cost=13.75..13.75 rows=2 width=4) (actual time=0.021..0.021 rows=2 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 5kB -> Seq Scan on department (cost=0.00..13.75 rows=2 width=4) (actual time=0.011..0.014 rows=2 loops=1) Filter: ((dpt_location)::text = 'eilat'::text) Rows Removed by Filter: 2 Planning Time: 0.273 ms Execution Time: 0.106 ms
Result of the
Hash Join (cost=13.78..27.57 rows=2 width=118) Hash Cond: (emp.emp_id = department.id) -> Seq Scan on emp (cost=0.00..13.00 rows=300 width=122) -> Hash (cost=13.75..13.75 rows=2 width=4) -> Seq Scan on department (cost=0.00..13.75 rows=2 width=4) Filter: ((dpt_location)::text = 'eilat'::text)
From both results, the Postgres planner used the Hash join strategy to retrieve information from both the
table. In addition, the planner scanned both tables using the sequential scan method.
Based on how the Postgres planner generates the execution plan for our queries, we can rely on this knowledge to write efficient queries as well as optimize read performance.
Speed optimization “rules”
These rules will guide you in PostgreSQL servers optimization.
Rule 1: Always choose index only scan over index scan if possible
How do we know if a Postgres planner is going to perform an index scan? From the previous examples describing how the planner generates an execution plan for each query, we could easily tell if the planner would scan a table using Index only scan or not.
Any time we try to search on the indexed column(s), the planner makes use of the Index scan instead of the Sequential scan or Index scan.
Why Is Index only scan efficient than an index scan?
The purpose of Index only scan is to fetch all data from the index without visiting the heap at all. Index scan, however, fetches data from both the heap and the Index.
Fetching data from the index only means reading fewer data. Obviously reading less data is faster than reading more data.
Rule 2: Reduce the number of columns in a table if there are excess columns
Among the types of scans generated by the Postgres planner, there is one known as the sequential scan. The sequential scan works by scanning the entire table/relation as stored on the disk.
For the sake of practical evidence, let’s assume there are two tables/relations:
residing in the database
The emp table has t columns namely
whilst the user’s table has four columns namely
If we run the query statement
SELECT * FROM
both tables, the following results are returned.
Query plan for table emp:
Planning Time: 0.284 ms Execution Time: 0.042 ms (3 rows)
Query Plan for Table users:
Planning Time: 0.331 ms Execution Time: 0.046 ms (3 rows)
So it seems the execution time for the emp table is a bit lower than the user’s table. The execution time in this context means the time it took the server to run the query using the generated query plan by the Postgres planner.
In a real-world scenario where we have more than five columns or excess columns in a relation and users making concurrent requests to a particular table. What do you think will happen? Drop-in read performance.
How do we avoid it?
Solution: By reducing the number of columns in the table.
Rule 3: Shard wisely by choosing suitable sharding strategy
Sharding is the most preferred solution used by DB Reliability engineers to prevent a database server from crashing due to massive requests from clients. Startups building products or services for users to consume often make use of some sharding strategy to scale their services.
However if we create many shards without using the appropriate sharding strategy for our database infrastructure, read and write performance could be affected.
Rule 4: Use SSD disk for critical data files
An SSD disk has an access speed of 35 to 100 microseconds which is 100 times faster. This means faster speed for programs or services can access large data very quickly. Thus it is advisable to place mission-critical data files on SSD disks to increase both READ and WRITE performance.
Rule 5: Avoid using many indexes on a table
Assuming we have a table with six columns. Out of the six columns, five columns have been indexed. If a user updates records, all the affected indexes need to reflect the update. Thus, It affects performance for
Rule 6: Sufficient memory for caching
If there is enough memory for Postgres, frequently accessed data is read from the cache rather than from the disk. Even if you have placed these data files on SSD disks, it is advisable to increase the value for the
parameter higher than the default value to increase READ performance by caching more data. More memory cache, more data is cached.
Note: In a write-and-read intensive environment, it is advisable to increase the value of shared_buffer but not greater than 40% in order to leave adequate space for the operating system. So if you have database server with 3GB of RAM, you can set the the value to 30% or less. In addition, if you increased the value of the
parameter, you need to increase the value for the
parameter to allow longer period of retaining changed data before data is written to the data files.
However it is not always right to increase the value of both the shared_buffer parameter and the
parameter to enhance read and write performance.
Let’s say in an environment where a postgresql database is used to support workloads such as subscription services for users subscribing to a product to be released in the near future.
In this case, we expect the application to write to the database continually without making many read requests. Unlike the previous example where we increased the value to less than 40%, we can decide to lower the value of the
to at least 25% and rather increase the value of the
to a higher value .
WHY? Because we do not expect most of the users who subscribed to our service to make read requests to the database server. Some could decide to change their email address afterwards. So we increase the value of the max_wal_size to prevent frequent writing of data.
Although these steps are basic, there are very necessary when you intend to use your database infrastructure to support read and write-intensive workloads. If you fail to get the basics right, it would be difficult to use advanced techniques such as checkpointing, replication, and others to enhance read and write performance.
Why? Because checkpointing, replication, and other secondary methods rely on these basics steps to optimize read and write performance. For instance without SSD or insufficient memory cache, no matter how you have perfectly tuned checkpointing and whether you are using a mixture of async-sync replication methods to replicate data, the Postgres server takes some time to get back to clients.