pgpdf

PDF type with meta admin & Full-Text Search

Module:

PGEXT

Overview

PIGSTY 3rd Party Extension: pgpdf : PDF type with meta admin & Full-Text Search

Information

Extension ID: 3530
Extension Name: pgpdf
Package Name: pgpdf
Category: TYPE
License: GPLv3
Website: https://github.com/Florents-Tselai/pgpdf
Language: C
Extra Tags: N/A
Comment: N/A

Metadata

Latest Version: 0.1.0
Postgres Support: 17,16,15,14,13
Need Load: Explicit Loading Required
Need DDL: Need CREATE EXTENSION DDL
Relocatable: Can not install to arbitrary schema
Trusted: Trusted, Can be created by user with CREATE Privilege
Schemas: N/A
Requires: N/A

RPM / DEB

RPM Repo: PIGSTY
RPM Name: pgpdf_$v*
RPM Ver : 0.1.0
RPM Deps: N/A
DEB Repo: PIGSTY
DEB Name: postgresql-$v-pgpdf
DEB Ver : 0.1.0
DEB Deps: N/A

Availability

OS	Arch	PG17	PG16	PG15	PG14	PG13
`el8`	`x86_64`	`pgpdf_17` PIGSTY 0.1.0	`pgpdf_16` PIGSTY 0.1.0	`pgpdf_15` PIGSTY 0.1.0	`pgpdf_14` PIGSTY 0.1.0	`pgpdf_13` PIGSTY 0.1.0
`el8`	`aarch64`	`pgpdf_17` PIGSTY 0.1.0	`pgpdf_16` PIGSTY 0.1.0	`pgpdf_15` PIGSTY 0.1.0	`pgpdf_14` PIGSTY 0.1.0	`pgpdf_13` PIGSTY 0.1.0
`el9`	`x86_64`	`pgpdf_17` PIGSTY 0.1.0	`pgpdf_16` PIGSTY 0.1.0	`pgpdf_15` PIGSTY 0.1.0	`pgpdf_14` PIGSTY 0.1.0	`pgpdf_13` PIGSTY 0.1.0
`el9`	`aarch64`	`pgpdf_17` PIGSTY 0.1.0	`pgpdf_16` PIGSTY 0.1.0	`pgpdf_15` PIGSTY 0.1.0	`pgpdf_14` PIGSTY 0.1.0	`pgpdf_13` PIGSTY 0.1.0
`d12`	`x86_64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`d12`	`aarch64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`u22`	`x86_64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`u22`	`aarch64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`u24`	`x86_64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`u24`	`aarch64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0

Installation

Install pgpdf via the pig CLI tool:

pig ext install pgpdf

Install pgpdf via Pigsty playbook:

./pgsql.yml -t pg_extension -e '{"pg_extensions": ["pgpdf"]}' # -l <cls>

Install pgpdf RPM from YUM repo directly:

dnf install pgpdf_17*;
dnf install pgpdf_16*;
dnf install pgpdf_15*;
dnf install pgpdf_14*;
dnf install pgpdf_13*;

Install pgpdf DEB from APT repo directly:

apt install postgresql-17-pgpdf;
apt install postgresql-16-pgpdf;
apt install postgresql-15-pgpdf;
apt install postgresql-14-pgpdf;
apt install postgresql-13-pgpdf;

Extension pgpdf has to be loaded via shared_preload_libraries

shared_preload_libraries = 'pgpdf'; # add to pg cluster config

Create pgpdf extension on PostgreSQL cluster:

CREATE EXTENSION pgpdf;

Usage

The actual PDF parsing is done by poppler.

This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.

Download some PDFs.

wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf

You can create a pdf type, by casting either a text filepath or bytea column.

CREATE EXTENSION pgpdf;
SELECT '/tmp/pgintro.pdf'::pdf;

                                       pdf                                        
----------------------------------------------------------------------------------
 PostgreSQL Introduction                                                         +
 Digoal.Zhou                                                                     +
 7/20/2011Catalog                                                                +
  PostgreSQL Origin

If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea column, you can just cast it to pdf.

SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;

Examples

Create a table with a pdf column:

CREATE TABLE pdfs(name text primary key, doc pdf);

INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');

Parsing and validation should happen automatically. The files will be read from the disk only once!

[!NOTE] The filepath should be accessible by the postgres process / user! That’s different than the user running psql. If you don’t understand what this means, as your DBA!

String Functions and Operators

Standard Postgres String Functions and Operators should work as usual:

SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;

SELECT upper('/tmp/pgintro.pdf'::pdf::text);

SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';

Full-Text Search (FTS)

You can also perform full-text search (FTS), since you can work on a pdf file like normal text.

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');

 ?column? 
----------
 t
(1 row)

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');

 ?column? 
----------
 f
(1 row)

Document similarity with `pg_trgm`

You can use pg_trgm to get the similarity between two documents:

CREATE EXTENSION pg_trgm;

SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);

Metadata

The following functions are available:

pdf_title(pdf) → text
pdf_author(pdf) → text
pdf_num_pages(pdf) → integer

Total number of pages in the document
pdf_page(pdf, integer) → text

Get the i-th page as text
pdf_creator(pdf) → text
pdf_keywords(pdf) → text
pdf_metadata(pdf) → text
pdf_version(pdf) → text
pdf_subject(pdf) → text
pdf_creation(pdf) → timestamp
pdf_modification(pdf) → timestamp

SELECT pdf_title('/tmp/pgintro.pdf');

        pdf_title        
-------------------------
 PostgreSQL Introduction
(1 row)

SELECT pdf_author('/tmp/pgintro.pdf');

 pdf_author 
------------
 周正中
(1 row)

Getting a subset of pages

SELECT pdf_num_pages('/tmp/pgintro.pdf');

 pdf_num_pages 
---------------
            24
(1 row)

SELECT pdf_page('/tmp/pgintro.pdf', 1);

           pdf_page           
------------------------------
 Catalog                     +
  PostgreSQL Origin         +
  Layout                    +
  Features                  +
  Enterprise Class Attribute+
  Case
(1 row)

SELECT pdf_subject('/tmp/pgintro.pdf');

 pdf_subject 
-------------
 
(1 row)

SELECT pdf_creation('/tmp/pgintro.pdf');

       pdf_creation       
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)

SELECT pdf_modification('/tmp/pgintro.pdf');

     pdf_modification     
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)

SELECT pdf_creator('/tmp/pgintro.pdf');

            pdf_creator             
------------------------------------
 Microsoft® Office PowerPoint® 2007
(1 row)

SELECT pdf_metadata('/tmp/pgintro.pdf');

 pdf_metadata 
--------------
 
(1 row)

SELECT pdf_version('/tmp/pgintro.pdf');

 pdf_version 
-------------
 PDF-1.5
(1 row)

Installation

Install poppler dependencies

Linux

sudo apt install -y libpoppler-glib-dev pkg-config

Homebrew/MacOS

brew install poppler pkgconf

cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo

After the installation, in a session:

CREATE EXTENSION pgpdf;

Docker

Get the Docker image with:

docker pull florents/pgpdf:pg17

This adds pgpdf to the Postgres image (replace 17 with your Postgres server version, and run it the same way).

Run the image in a container.

docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17

Through another terminal, connect to the running server (container).

PGPASSWORD=pass psql -h localhost -p 5432 -U postgres

[!WARNING] Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified 2025-03-21: use global url (cc35107)

pgpdf

Module:

Categories:

Tags:

Overview

Information

Metadata

RPM / DEB

Availability

Installation

Usage

Examples

String Functions and Operators

Full-Text Search (FTS)

Document similarity with `pg_trgm`

Metadata

Installation

Docker

Feedback

pgpdf

Overview

Information

Metadata

RPM / DEB

Availability

Installation

Usage

Examples

String Functions and Operators

Full-Text Search (FTS)

Document similarity with pg_trgm

Metadata

Installation

Docker

Feedback

Document similarity with `pg_trgm`