Optimizing read and write performance varies for every Postgres database server in a different environment. Even if we decide to increase the shared_buffer
parameter for startup A Postgres server, it may not be wise to do so for startup B.
In addition, we need some understanding of how the Postgres planner decides how to execute a specific query, in order to write an efficient SQL to boost read and write performance.
In this post, we’re looking at basic ways to enhance read and write performance for the Postgres server in different scenarios. The rules listed in this post are generic recommendations that you can apply to your Postgres servers, according to your own needs.
Before we take a look at how to improve read and write performance for a Postgres server, we’ll discuss some facts about databases and query execution components of PostgreSQL.
Database facts and query execution components
Some facts and some details about PostgreSQL query planner that we should know, in order to optimize our PostgreSQL servers.
Fact 1: Less Write, More Read
“A well-balanced memory cache makes your customers happy”
Let’s take a look at an imaginary example: Ticket Air, a ticketing platform uses PostgreSQL to save customers’ data. This fictional company relies on span_query
, and `term_query
` provided by Elasticsearch to build a faster search solution for their customers searching for the cheapest flight to destinations such as Tel-Aviv, Budapest and so on.
A customer who recently registered on ticketing.il is looking for the cheapest flight to Tel Aviv. In this case, we should not expect the application to perform WRITE operation on behalf of the customer.
The web application takes the customer’s query and searches through the flight tables via the price Column for affordable prices. A few instants later, the following results are displayed to the customer who queried for it.
Another set of customers requests for a similar query. If there is enough memory for the PostgreSQL cache to store the previous results, it is fetched directly from the cache without accessing the disk.
Fact 2: Database configuration is not fixed
Configuration settings of a PostgreSQL server in an aviation industry differ from that of social media site. The same parameters such as wal_buffer
, and wal_writer_delay
, are utilized to configure a database server. However, the values assigned to these parameters to optimize WRITE/READ performance differ based on the core functionality of a web service or a web application.
Query execution components
- Parser: It examines or checks the syntax of a SQL string.
- Rewriter: In some instances, we prefer to query against a view instead of querying directly against tables. The function of the Rewriter component is to modify it to optimize the query.
- Planner: It is one of the most important components. It generates a plan to execute a specific query.
- Executor: It works together with the Planner component to execute the plan generated by the planner.
Analyzing the execution plan by the planner
The planner is responsible for coming up with a plan to execute a query. The planner decides the plan of execution based on the type of query. Knowing how the planner decides a plan for a specific query helps in writing efficient queries.
Assuming there are two tables for the national employee database. The employee
table stores the names of all employees working for both private and public companies in Israel whilst the department
table stores the list of departments where every employee in the employee’s tables.
Employee table:
CREATE TABLE emp (
emp_id serial PRIMARY KEY,
first_name varchar(50),
last_name varchar(50)
);
first_name | last_name | emp_id
------------+-----------+--------
Ben | Cohen | 1
Eitan | Lahav | 2
Dror | Hevroni | 3
Shira | Eilat | 4
Department table:
CREATE TABLE department (
id serial PRIMARY KEY,
dpt_name varchar(50),
dpt_location varchar(50),
CONSTRAINT fk_id FOREIGN KEY (id) REFERENCES emp (emp_id)
);
id | dpt_name | dpt_location
----+-------------+--------------
1 | accounting | chicago
2 | engineering | tel-aviv
3 | HR | eilat
4 | research | eilat
Now let’s find out what decision the planner will take to execute the following query using the EXPLAIN
command or EXPLAIN ANALYZE
command.
Note: The EXPLAIN
command displays the execution plan generated by the Postgres planner whilst the EXPLAIN ANALYZE
actually executes the query in addition to displaying the execution plan.
Accessing a single table
First example
EXPLAIN SELECT first_name from emp WHERE last_name = "Eilat";
EXPLAIN ANALYSE SELECT first_name FROM emp WHERE last_name = 'Eilat';
This query selects the first name
of every employee whose last_name
is "Eilat"
.
Result of EXPLAIN
command:
Seq Scan on emp (cost=0.00..13.75 rows=2 width=118)
Filter: ((last_name)::text = 'Eilat'::text)
Result of EXPLAIN ANALYSE
command:
Seq Scan on emp (cost=0.00..13.75 rows=2 width=118) (actual time=0.019..0.020 rows=1 loops=1)
Filter: ((last_name)::text = 'Eilat'::text)
Rows Removed by Filter: 3
Planning Time: 33.502 ms
Execution Time: 0.053 ms
The result of both EXPLAIN
and EXPLAIN ANALYSE
commands show that in the first example the Postgres planner scans the emp
table using the sequential scan method.
Second example
Now let’s take another look at how the Postgres planner generates an execution plan for the second statement below:
EXPLAIN ANALYSE SELECT * FROM emp WHERE emp_id = 4;
EXPLAIN SELECT * FROM emp WHERE emp_id = 4;
Result of the EXPLAIN ANALYZE
command:
Index Scan using emp_pkey on emp (cost=0.15..8.17 rows=1 width=240) (actual time=0.867..0.870 rows=1 loops=1)
Index Cond: (emp_id = 4)
Planning Time: 89.925 ms
Execution Time: 12.548 ms
Result of the EXPlAIN
command:
Index Scan using emp_pkey on emp (cost=0.15..8.17 rows=1 width=240)
Index Cond: (emp_id = 4)
In this case, the Postgres planner generated an execution plan for the statement using the Index scan.
Accessing Two Tables
Now let’s find out which method the Postgres planner is going to use when accessing data from two tables:
Assuming we issued a query like the one below to find employees whose department is located in Eilat with both EXPLAIN
and EXPLAIN ANALYZE
command:
EXPLAIN SELECT first_name FROM emp, department WHERE emp.emp_id = department.id AND department.dpt_location = 'eilat';
Result of the EXPLAIN ANALYZE
command:
Hash Join (cost=13.78..27.57 rows=2 width=118) (actual time=0.051..0.054 rows=2 loops=1)
Hash Cond: (emp.emp_id = department.id)
-> Seq Scan on emp (cost=0.00..13.00 rows=300 width=122) (actual time=0.016..0.017 rows=4 loops=1)
-> Hash (cost=13.75..13.75 rows=2 width=4) (actual time=0.021..0.021 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
-> Seq Scan on department (cost=0.00..13.75 rows=2 width=4) (actual time=0.011..0.014 rows=2 loops=1)
Filter: ((dpt_location)::text = 'eilat'::text)
Rows Removed by Filter: 2
Planning Time: 0.273 ms
Execution Time: 0.106 ms
Result of the EXPLAIN
command:
Hash Join (cost=13.78..27.57 rows=2 width=118)
Hash Cond: (emp.emp_id = department.id)
-> Seq Scan on emp (cost=0.00..13.00 rows=300 width=122)
-> Hash (cost=13.75..13.75 rows=2 width=4)
-> Seq Scan on department (cost=0.00..13.75 rows=2 width=4)
Filter: ((dpt_location)::text = 'eilat'::text)
From both results, the Postgres planner used the Hash join strategy to retrieve information from both the emp
table and department
table. In addition, the planner scanned both tables using the sequential scan method.
Apart from the hash join strategy, there are other join methods like merge join and nested loops used by the planner when accessing two tables.
Based on how the Postgres planner generates the execution plan for our queries, we can rely on this knowledge to write efficient queries as well as optimize read performance.
Speed optimization “rules”
These rules will guide you in PostgreSQL servers optimization.
Rule 1: Always choose index only scan over index scan if possible
How do we know if a Postgres planner is going to perform an index scan? From the previous examples describing how the planner generates an execution plan for each query, we could easily tell if the planner would scan a table using Index only scan or not.
Any time we try to search on the indexed column(s), the planner makes use of the Index scan instead of the Sequential scan or Index scan.
Why Is Index only scan efficient than an index scan?
The purpose of Index only scan is to fetch all data from the index without visiting the heap at all. Index scan, however, fetches data from both the heap and the Index.
Fetching data from the index only means reading fewer data. Obviously reading less data is faster than reading more data.
Rule 2: Reduce the number of columns in a table if there are excess columns
Among the types of scans generated by the Postgres planner, there is one known as the sequential scan. The sequential scan works by scanning the entire table/relation as stored on the disk.
For the sake of practical evidence, let’s assume there are two tables/relations: emp
and users
residing in the database demo
.
The emp table has t columns namely name
and location
whilst the user’s table has four columns namely first_name
, middle_name
, last_name
, and location
.
If we run the query statement SELECT * FROM
both tables, the following results are returned.
Query plan for table emp:
Planning Time: 0.284 ms
Execution Time: 0.042 ms
(3 rows)
Query Plan for Table users:
Planning Time: 0.331 ms
Execution Time: 0.046 ms
(3 rows)
So it seems the execution time for the emp table is a bit lower than the user’s table. The execution time in this context means the time it took the server to run the query using the generated query plan by the Postgres planner.
In a real-world scenario where we have more than five columns or excess columns in a relation and users making concurrent requests to a particular table. What do you think will happen? Drop-in read performance.
How do we avoid it?
Solution: By reducing the number of columns in the table.
Rule 3: Shard wisely by choosing suitable sharding strategy
Sharding is the most preferred solution used by DB Reliability engineers to prevent a database server from crashing due to massive requests from clients. Startups building products or services for users to consume often make use of some sharding strategy to scale their services.
However if we create many shards without using the appropriate sharding strategy for our database infrastructure, read and write performance could be affected.
Rule 4: Use SSD disk for critical data files
An SSD disk has an access speed of 35 to 100 microseconds which is 100 times faster. This means faster speed for programs or services can access large data very quickly. Thus it is advisable to place mission-critical data files on SSD disks to increase both READ and WRITE performance.
Rule 5: Avoid using many indexes on a table
Assuming we have a table with six columns. Out of the six columns, five columns have been indexed. If a user updates records, all the affected indexes need to reflect the update. Thus, It affects performance for INSERT
/UPDATE
operations.
Rule 6: Sufficient memory for caching
If there is enough memory for Postgres, frequently accessed data is read from the cache rather than from the disk. Even if you have placed these data files on SSD disks, it is advisable to increase the value for the shared_buffer
parameter higher than the default value to increase READ performance by caching more data. More memory cache, more data is cached.
Note: In a write-and-read intensive environment, it is advisable to increase the value of shared_buffer but not greater than 40% in order to leave adequate space for the operating system. So if you have database server with 3GB of RAM, you can set the the value to 30% or less. In addition, if you increased the value of the shared_buffer
parameter, you need to increase the value for the max_wal_size
parameter to allow longer period of retaining changed data before data is written to the data files.
However it is not always right to increase the value of both the shared_buffer parameter and the max_wal_size
parameter to enhance read and write performance.
Let’s say in an environment where a postgresql database is used to support workloads such as subscription services for users subscribing to a product to be released in the near future.
In this case, we expect the application to write to the database continually without making many read requests. Unlike the previous example where we increased the value to less than 40%, we can decide to lower the value of the shared_buffer
to at least 25% and rather increase the value of the max_wal_size
to a higher value .
WHY? Because we do not expect most of the users who subscribed to our service to make read requests to the database server. Some could decide to change their email address afterwards. So we increase the value of the max_wal_size to prevent frequent writing of data.
Although these steps are basic, there are very necessary when you intend to use your database infrastructure to support read and write-intensive workloads. If you fail to get the basics right, it would be difficult to use advanced techniques such as checkpointing, replication, and others to enhance read and write performance.
Why? Because checkpointing, replication, and other secondary methods rely on these basics steps to optimize read and write performance. For instance without SSD or insufficient memory cache, no matter how you have perfectly tuned checkpointing and whether you are using a mixture of async-sync replication methods to replicate data, the Postgres server takes some time to get back to clients.
Conclusions
If you want to optimise your PostgreSQL performance, consider our PostgreSQL Health Checks.
Michael Aboagye
0 Comments