You have been assigned by the marketing team to load bulk data from a CSV file into a PostgreSQL table. How would you go about it?
COPY command in PostgreSQL is the most preferred option here. Let’s look at how the
COPY command can be used to load data from a CSV file into a PostgreSQL database.
We assume you already know how to create a database, table and also define the data types in PostgreSQL. If not, you can check the PostgreSQL documentation for detailed information.
How does the copy command in PostgreSQL Works?
COPY command copies files from CSV files into tables in a PostgreSQL database or from tables to CSV files. It is the preferred method when you want to load bulk data into a PostgreSQL table.
COPY command is different from the meta-command
copy which is used by the psql client to load bulk data into PostgreSQL tables. When loading a large dataset using the
COPY command, this command should be executed on the PostgreSQL server hosting the target tables. In the case of the
copy used by the psql client, it should be executed from the host where the client resides.
There are two types of
COPY command available in PostgreSQL. There are
COPY TO and
COPY FROM commands.
COPY TO command is used to copy a table in PostgreSQL to a selected file whilst the
COPY FROM command is used to copy records in a file to a table.
COPY command supports different file formats such as csv, binary, and text.
However, there is a slight difference between the
COPY TO command and the
COPY FROM command in terms of transferring files from one source of storage to another.
For instance, when copying data from files to a Postgresql table there is no need to specify the absolute path of the target file but in a situation where you are copying data from Postgres tables to csv files, then you need to specify the absolute path of the target files.
Using the COPY FROM command
Now let’s look at how to use the
COPY FROM command to copy data in CSV file format.
Apart from CSV file format, it supports other file formats such as text and binary.
Copy from CSV file to PostgreSQL table
To copy the
details.csv file using the
COPY FROM command, execute the following on your Linux terminal:
COPY details (FName, LName, DoB, Email) FROM '/home/mikey/Downloads/details.csv' DELIMITER ',' CSV HEADER;
The output below means the
COPY FROM command has copied all 10 rows in the
details.csv file into the
The query below displays all rows which were available in the
details.csv file before they were imported into the
details table in PostgreSQL.
select * from details; fname | lname | dob | email --------+------------+------------+--------------------- John | Doe | 1995-01-05 | email@example.com Jane | Doe | 1995-02-05 | firstname.lastname@example.org Davies | Michel | 1990-04-05 | email@example.com Amas | Wick | 1998-03-08 | firstname.lastname@example.org Esther | Wigan | 1997-06-05 | email@example.com Frank | James | 1996-04-07 | firstname.lastname@example.org Gordon | Mack | 1978-04-03 | email@example.com Femi | Akemidola | 1998-05-06 | firstname.lastname@example.org Denis | Wayo | 1996-05-04 | email@example.com Shola | Brown | 1995-04-02 | firstname.lastname@example.org (10 rows)
How long does it take COPY FROM command to import data?
If we want to find out how long it takes to import all the 10 rows inside the
main.csv file into the
batch table, we can use the following command to enable timing on the Postgres server as shown below:
Try to import the 10 rows inside the
main.csv file into the postgres table
batch to find out how long it takes.
COPY batch (FName, City, Gender) FROM '/home/mikey/Downloads/main.csv' DELIMITER ',' CSV HEADER;
The output below shows that the
COPY FROM command took 55 milliseconds to import 10 rows into the
COPY 10 Time: 55.664 ms
COPY FROM command is far more efficient when importing data into a PostgreSQL table in bulk rather than inserting rows one after another as shown below:
INSERT INTO memo (FName, City, Gender) VALUES ('mike', 'Accra', 'male'); INSERT 0 1 Time: 67.555 ms
Although by default, the
COPY FROM command is optimized for bulk loading of data, we can make it more efficient by focusing on the following:
maintenance_work_memparameter: Increasing the value of this parameter increases the performance of the
COPY FROMcommand if you are going to create indexes immediately after populating the table with the
checkpoint_segments: In PostgreSQL when checkpointing occurs, all dirty pages are flushed to the disk for persistence. Each
checkpoint_segmentis usually 16 megabytes. So in the case of loading bulk data, Postgres will checkpoint more frequently than the specified checkpoint frequency via
Checkpointing frequently result in performance issues because of excess use of operating system resources. You can prevent frequent checkpointing by increasing checkpoint segment beyond 16 megabytes.
Finally, it is advisable to enable the
autovacuum daemon to ensure that the query planner has up-to-date statistics about the table you are about to load bulk data. So in a situation where we have two large CSV files to import into a table in PostgreSQL, the query planner can rely on recent statistics about the table to make efficient decisions related to query planning.
If you need assistance with your PostgreSQL operations, consider our PostgreSQL Health Checks.