Populating tables faster in PostgreSQL

Last updated on 18 September 2021

You have been assigned by the marketing team to load bulk data from a CSV file into a PostgreSQL table. How would you go about it?

The COPY command in PostgreSQL is the most preferred option here. Let’s look at how the COPY command can be used to load data from a CSV file into a PostgreSQL database.

We assume you already know how to create a database, table and also define the data types in PostgreSQL. If not, you can check the PostgreSQL documentation for detailed information.

Blixy loves the PostgreSQL COPY command.

How does the copy command in PostgreSQL Works?

The COPY command copies files from CSV files into tables in a PostgreSQL database or from tables to CSV files. It is the preferred method when you want to load bulk data into a PostgreSQL table.

This COPY command is different from the meta-command \copy which is used by the psql client to load bulk data into PostgreSQL tables. When loading a large dataset using the COPY command, this command should be executed on the PostgreSQL server hosting the target tables. In the case of the \copy used by the psql client, it should be executed from the host where the client resides.

There are two types of COPY command available in PostgreSQL. There are COPY TO and COPY FROM commands. COPY TO command is used to copy a table in PostgreSQL to a selected file whilst the COPY FROM command is used to copy records in a file to a table.

The COPY command supports different file formats such as csv, binary, and text.

However, there is a slight difference between the COPY TO command and the COPY FROM command in terms of transferring files from one source of storage to another.

For instance, when copying data from files to a Postgresql table there is no need to specify the absolute path of the target file but in a situation where you are copying data from Postgres tables to csv files, then you need to specify the absolute path of the target files.

Using the COPY FROM command

Now let’s look at how to use the COPY FROM command to copy data in CSV file format.

Apart from CSV file format, it supports other file formats such as text and binary.

Copy from CSV file to PostgreSQL table

To copy the details.csv file using the COPY FROM command, execute the following on your Linux terminal:

COPY details (FName, LName, DoB, Email)
FROM '/home/mikey/Downloads/details.csv'
DELIMITER ','
CSV HEADER;

The output below means the COPY FROM command has copied all 10 rows in the details.csv file into the details table.

COPY 10

The query below displays all rows which were available in the details.csv file before they were imported into the details table in PostgreSQL.

select * from details;
fname | lname | dob | email
--------+------------+------------+---------------------
John | Doe | 1995-01-05 | john@gmail.com
Jane | Doe | 1995-02-05 | jane.doe@gmail.com
Davies | Michel | 1990-04-05 | davies@gmail.com
Amas | Wick | 1998-03-08 | wick@gmail.com
Esther | Wigan | 1997-06-05 | wigan@ioamail.com
Frank | James | 1996-04-07 | james@gmail.com
Gordon | Mack | 1978-04-03 | ja@gmail.com
Femi | Akemidola | 1998-05-06 | akemi@gmail.com
Denis | Wayo | 1996-05-04 | wayo@gmail.com
Shola | Brown | 1995-04-02 | shola@gmail.com
(10 rows)

How long does it take COPY FROM command to import data?

If we want to find out how long it takes to import all the 10 rows inside the main.csv file into the batch table, we can use the following command to enable timing on the Postgres server as shown below:

\timing on

Try to import the 10 rows inside the main.csv file into the postgres table batch to find out how long it takes.

COPY batch (FName, City, Gender)
FROM '/home/mikey/Downloads/main.csv'
DELIMITER ','
CSV HEADER;

The output below shows that the COPY FROM command took 55 milliseconds to import 10 rows into the batch table.

COPY 10
Time: 55.664 ms

So the COPY FROM command is far more efficient when importing data into a PostgreSQL table in bulk rather than inserting rows one after another as shown below:

INSERT INTO memo (FName, City, Gender) VALUES ('mike', 'Accra', 'male');
INSERT 0 1
Time: 67.555 ms

Conclusions

Although by default, the COPY FROM command is optimized for bulk loading of data, we can make it more efficient by focusing on the following:

  • Increase maintenance_work_mem parameter: Increasing the value of this parameter increases the performance of the COPY FROM command if you are going to create indexes immediately after populating the table with the COPY command.
  • Increase checkpoint_segments : In PostgreSQL when checkpointing occurs, all dirty pages are flushed to the disk for persistence. Each checkpoint_segment is usually 16 megabytes. So in the case of loading bulk data, Postgres will checkpoint more frequently than the specified checkpoint frequency via checkpoint_timeout parameter.

Checkpointing frequently result in performance issues because of excess use of operating system resources. You can prevent frequent checkpointing by increasing checkpoint segment beyond 16 megabytes.

Finally, it is advisable to enable the autovacuum daemon to ensure that the query planner has up-to-date statistics about the table you are about to load bulk data. So in a situation where we have two large CSV files to import into a table in PostgreSQL, the query planner can rely on recent statistics about the table to make efficient decisions related to query planning.

Michael Aboagye

Did you like this article?

About Michael Aboagye

Michael is PostgreSQL consultant at Vettabase. He is specialised in PostgreSQL performance, security and automation.

Leave a Reply

Your email address will not be published. Required fields are marked *

*