Last updated on 18 September 2021
You have been assigned by the marketing team to load bulk data from a CSV file into a PostgreSQL table. How would you go about it?
command in PostgreSQL is the most preferred option here. Let’s look at how the
command can be used to load data from a CSV file into a PostgreSQL database.
We assume you already know how to create a database, table and also define the data types in PostgreSQL. If not, you can check the PostgreSQL documentation for detailed information.
How does the copy command in PostgreSQL Works?
command copies files from CSV files into tables in a PostgreSQL database or from tables to CSV files. It is the preferred method when you want to load bulk data into a PostgreSQL table.
command is different from the meta-command
which is used by the psql client to load bulk data into PostgreSQL tables. When loading a large dataset using the
command, this command should be executed on the PostgreSQL server hosting the target tables. In the case of the
used by the psql client, it should be executed from the host where the client resides.
There are two types of
command available in PostgreSQL. There are
command is used to copy a table in PostgreSQL to a selected file whilst the
command is used to copy records in a file to a table.
command supports different file formats such as csv, binary, and text.
However, there is a slight difference between the
command and the
command in terms of transferring files from one source of storage to another.
For instance, when copying data from files to a Postgresql table there is no need to specify the absolute path of the target file but in a situation where you are copying data from Postgres tables to csv files, then you need to specify the absolute path of the target files.
Using the COPY FROM command
Now let’s look at how to use the
command to copy data in CSV file format.
Apart from CSV file format, it supports other file formats such as text and binary.
Copy from CSV file to PostgreSQL table
To copy the
file using the
command, execute the following on your Linux terminal:
COPY details (FName, LName, DoB, Email) FROM '/home/mikey/Downloads/details.csv' DELIMITER ',' CSV HEADER;
The output below means the
command has copied all 10 rows in the
file into the
The query below displays all rows which were available in the
file before they were imported into the
table in PostgreSQL.
select * from details; fname | lname | dob | email --------+------------+------------+--------------------- John | Doe | 1995-01-05 | email@example.com Jane | Doe | 1995-02-05 | firstname.lastname@example.org Davies | Michel | 1990-04-05 | email@example.com Amas | Wick | 1998-03-08 | firstname.lastname@example.org Esther | Wigan | 1997-06-05 | email@example.com Frank | James | 1996-04-07 | firstname.lastname@example.org Gordon | Mack | 1978-04-03 | email@example.com Femi | Akemidola | 1998-05-06 | firstname.lastname@example.org Denis | Wayo | 1996-05-04 | email@example.com Shola | Brown | 1995-04-02 | firstname.lastname@example.org (10 rows)
How long does it take COPY FROM command to import data?
If we want to find out how long it takes to import all the 10 rows inside the
file into the
table, we can use the following command to enable timing on the Postgres server as shown below:
Try to import the 10 rows inside the
file into the postgres table
to find out how long it takes.
COPY batch (FName, City, Gender) FROM '/home/mikey/Downloads/main.csv' DELIMITER ',' CSV HEADER;
The output below shows that the
command took 55 milliseconds to import 10 rows into the
COPY 10 Time: 55.664 ms
command is far more efficient when importing data into a PostgreSQL table in bulk rather than inserting rows one after another as shown below:
INSERT INTO memo (FName, City, Gender) VALUES ('mike', 'Accra', 'male'); INSERT 0 1 Time: 67.555 ms
Although by default, the
command is optimized for bulk loading of data, we can make it more efficient by focusing on the following:
maintenance_work_memparameter: Increasing the value of this parameter increases the performance of the
COPY FROMcommand if you are going to create indexes immediately after populating the table with the
checkpoint_segments: In PostgreSQL when checkpointing occurs, all dirty pages are flushed to the disk for persistence. Each
checkpoint_segmentis usually 16 megabytes. So in the case of loading bulk data, Postgres will checkpoint more frequently than the specified checkpoint frequency via
Checkpointing frequently result in performance issues because of excess use of operating system resources. You can prevent frequent checkpointing by increasing checkpoint segment beyond 16 megabytes.
Finally, it is advisable to enable the
daemon to ensure that the query planner has up-to-date statistics about the table you are about to load bulk data. So in a situation where we have two large CSV files to import into a table in PostgreSQL, the query planner can rely on recent statistics about the table to make efficient decisions related to query planning.