pgpdf
Overview
| Package | Version | Category | License | Language |
|---|---|---|---|---|
pgpdf | 0.1.0 | TYPE | GPL-3.0 | C |
| ID | Extension | Bin | Lib | Load | Create | Trust | Reloc | Schema |
|---|---|---|---|---|---|---|---|---|
| 3530 | pgpdf | No | Yes | Yes | Yes | Yes | Yes | - |
| Related | pgjq pgjwt prefix semver unit pglite_fusion md5hash asn1oid |
|---|
Version
| Type | Repo | Version | PG Ver | Package | Deps |
|---|---|---|---|---|---|
| EXT | PIGSTY | 0.1.0 | 1817161514 | pgpdf | - |
| RPM | PIGSTY | 0.1.0 | 1817161514 | pgpdf_$v | - |
| DEB | PIGSTY | 0.1.0 | 1817161514 | postgresql-$v-pgpdf | - |
Build
You can build the RPM / DEB packages for pgpdf using pig build:
pig build pkg pgpdf # build RPM / DEB packages
Install
You can install pgpdf directly. First, make sure the PGDG and PIGSTY repositories are added and enabled:
pig repo add pgsql -u # Add repo and update cache
Install the extension using pig or apt/yum/dnf:
pig install pgpdf; # Install for current active PG version
pig ext install -y pgpdf -v 18 # PG 18
pig ext install -y pgpdf -v 17 # PG 17
pig ext install -y pgpdf -v 16 # PG 16
pig ext install -y pgpdf -v 15 # PG 15
pig ext install -y pgpdf -v 14 # PG 14
dnf install -y pgpdf_18 # PG 18
dnf install -y pgpdf_17 # PG 17
dnf install -y pgpdf_16 # PG 16
dnf install -y pgpdf_15 # PG 15
dnf install -y pgpdf_14 # PG 14
apt install -y postgresql-18-pgpdf # PG 18
apt install -y postgresql-17-pgpdf # PG 17
apt install -y postgresql-16-pgpdf # PG 16
apt install -y postgresql-15-pgpdf # PG 15
apt install -y postgresql-14-pgpdf # PG 14
Preload:
shared_preload_libraries = 'pgpdf';
Create Extension:
CREATE EXTENSION pgpdf;
Usage
The actual PDF parsing is done by poppler.
This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.
Download some PDFs.
wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf
You can create a pdf type, by casting either a text filepath or bytea column.
CREATE EXTENSION pgpdf;
SELECT '/tmp/pgintro.pdf'::pdf;
pdf
----------------------------------------------------------------------------------
PostgreSQL Introduction +
Digoal.Zhou +
7/20/2011Catalog +
PostgreSQL Origin
If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea column, you can just cast it to pdf.
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;
Examples
Create a table with a pdf column:
CREATE TABLE pdfs(name text primary key, doc pdf);
INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');
Parsing and validation should happen automatically. The files will be read from the disk only once!
The filepath should be accessible by the postgres process / user!
That’s different than the user running psql.
If you don’t understand what this means, as your DBA!
String Functions and Operators
Standard Postgres String Functions and Operators should work as usual:
SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;
SELECT upper('/tmp/pgintro.pdf'::pdf::text);
SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';
Full-Text Search (FTS)
You can also perform full-text search (FTS), since you can work on a pdf file like normal text.
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
?column?
----------
t
(1 row)
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');
?column?
----------
f
(1 row)
Document similarity with pg_trgm
You can use pg_trgm to get the similarity between two documents:
CREATE EXTENSION pg_trgm;
SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);
Metadata
The following functions are available:
pdf_title(pdf) → textpdf_author(pdf) → textpdf_num_pages(pdf) → integerTotal number of pages in the document
pdf_page(pdf, integer) → textGet the i-th page as text
pdf_creator(pdf) → textpdf_keywords(pdf) → textpdf_metadata(pdf) → textpdf_version(pdf) → textpdf_subject(pdf) → textpdf_creation(pdf) → timestamppdf_modification(pdf) → timestamp
SELECT pdf_title('/tmp/pgintro.pdf');
pdf_title
-------------------------
PostgreSQL Introduction
(1 row)
SELECT pdf_author('/tmp/pgintro.pdf');
pdf_author
------------
周正中
(1 row)
Getting a subset of pages
SELECT pdf_num_pages('/tmp/pgintro.pdf');
pdf_num_pages
---------------
24
(1 row)
SELECT pdf_page('/tmp/pgintro.pdf', 1);
pdf_page
------------------------------
Catalog +
PostgreSQL Origin +
Layout +
Features +
Enterprise Class Attribute+
Case
(1 row)
SELECT pdf_subject('/tmp/pgintro.pdf');
pdf_subject
-------------
(1 row)
SELECT pdf_creation('/tmp/pgintro.pdf');
pdf_creation
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_modification('/tmp/pgintro.pdf');
pdf_modification
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_creator('/tmp/pgintro.pdf');
pdf_creator
------------------------------------
Microsoft® Office PowerPoint® 2007
(1 row)
SELECT pdf_metadata('/tmp/pgintro.pdf');
pdf_metadata
--------------
(1 row)
SELECT pdf_version('/tmp/pgintro.pdf');
pdf_version
-------------
PDF-1.5
(1 row)
Installation
Install poppler dependencies
Linux
sudo apt install -y libpoppler-glib-dev pkg-config
Homebrew/MacOS
brew install poppler pkgconf
cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo
After the installation, in a session:
CREATE EXTENSION pgpdf;
Docker
Get the Docker image with:
docker pull florents/pgpdf:pg17
This adds pgpdf to the Postgres image (replace 17 with your Postgres server version, and run it the same way).
Run the image in a container.
docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17
Through another terminal, connect to the running server (container).
PGPASSWORD=pass psql -h localhost -p 5432 -U postgres
Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.
Feedback
Was this page helpful?
Thanks for the feedback! Please let us know how we can improve.
Sorry to hear that. Please let us know how we can improve.