You have been assigned by the marketing team to load bulk data from a CSV file into a PostgreSQL table. How would you go about it?
The COPY
command in PostgreSQL is the most preferred option here. Let’s look at how the COPY
command can be used to load data from a CSV file into a PostgreSQL database.
We assume you already know how to create a database, table and also define the data types in PostgreSQL. If not, you can check the PostgreSQL documentation for detailed information.
How does the copy command in PostgreSQL Works?
The COPY
command copies files from CSV files into tables in a PostgreSQL database or from tables to CSV files. It is the preferred method when you want to load bulk data into a PostgreSQL table.
This COPY
command is different from the meta-command copy
which is used by the psql client to load bulk data into PostgreSQL tables. When loading a large dataset using the COPY
command, this command should be executed on the PostgreSQL server hosting the target tables. In the case of the copy
used by the psql client, it should be executed from the host where the client resides.
There are two types of COPY
command available in PostgreSQL. There are COPY TO
and COPY FROM
commands. COPY TO
command is used to copy a table in PostgreSQL to a selected file whilst the COPY FROM
command is used to copy records in a file to a table.
The COPY
command supports different file formats such as csv, binary, and text.
However, there is a slight difference between the COPY TO
command and the COPY FROM
command in terms of transferring files from one source of storage to another.
For instance, when copying data from files to a Postgresql table there is no need to specify the absolute path of the target file but in a situation where you are copying data from Postgres tables to csv files, then you need to specify the absolute path of the target files.
Using the COPY FROM command
Now let’s look at how to use the COPY FROM
command to copy data in CSV file format.
Apart from CSV file format, it supports other file formats such as text and binary.
Copy from CSV file to PostgreSQL table
To copy the details.csv
file using the COPY FROM
command, execute the following on your Linux terminal:
COPY details (FName, LName, DoB, Email)
FROM '/home/mikey/Downloads/details.csv'
DELIMITER ','
CSV HEADER;
The output below means the COPY FROM
command has copied all 10 rows in the details.csv
file into the details
table.
COPY 10
The query below displays all rows which were available in the details.csv
file before they were imported into the details
table in PostgreSQL.
select * from details;
fname | lname | dob | email
--------+------------+------------+---------------------
John | Doe | 1995-01-05 | jo**@gm***.com
Jane | Doe | 1995-02-05 | ja******@gm***.com
Davies | Michel | 1990-04-05 | da****@gm***.com
Amas | Wick | 1998-03-08 | wi**@gm***.com
Esther | Wigan | 1997-06-05 | wi***@io*****.com
Frank | James | 1996-04-07 | ja***@gm***.com
Gordon | Mack | 1978-04-03 | **@gm***.com
Femi | Akemidola | 1998-05-06 | ak***@gm***.com
Denis | Wayo | 1996-05-04 | wa**@gm***.com
Shola | Brown | 1995-04-02 | sh***@gm***.com
(10 rows)
How long does it take COPY FROM command to import data?
If we want to find out how long it takes to import all the 10 rows inside the main.csv
file into the batch
table, we can use the following command to enable timing on the Postgres server as shown below:
timing on
Try to import the 10 rows inside the main.csv
file into the postgres table batch
to find out how long it takes.
COPY batch (FName, City, Gender)
FROM '/home/mikey/Downloads/main.csv'
DELIMITER ','
CSV HEADER;
The output below shows that the COPY FROM
command took 55 milliseconds to import 10 rows into the batch
table.
COPY 10
Time: 55.664 ms
So the COPY FROM
command is far more efficient when importing data into a PostgreSQL table in bulk rather than inserting rows one after another as shown below:
INSERT INTO memo (FName, City, Gender) VALUES ('mike', 'Accra', 'male');
INSERT 0 1
Time: 67.555 ms
Conclusions
Although by default, the COPY FROM
command is optimized for bulk loading of data, we can make it more efficient by focusing on the following:
- Increase
maintenance_work_mem
parameter: Increasing the value of this parameter increases the performance of theCOPY FROM
command if you are going to create indexes immediately after populating the table with theCOPY
command. - Increase
checkpoint_segments
: In PostgreSQL when checkpointing occurs, all dirty pages are flushed to the disk for persistence. Eachcheckpoint_segment
is usually 16 megabytes. So in the case of loading bulk data, Postgres will checkpoint more frequently than the specified checkpoint frequency viacheckpoint_timeout
parameter.
Checkpointing frequently result in performance issues because of excess use of operating system resources. You can prevent frequent checkpointing by increasing checkpoint segment beyond 16 megabytes.
Finally, it is advisable to enable the autovacuum
daemon to ensure that the query planner has up-to-date statistics about the table you are about to load bulk data. So in a situation where we have two large CSV files to import into a table in PostgreSQL, the query planner can rely on recent statistics about the table to make efficient decisions related to query planning.
If you need assistance with your PostgreSQL operations, consider our PostgreSQL Health Checks.
Michael Aboagye
0 Comments