pgpdf

PDF type with meta admin & Full-Text Search

Overview

PIGSTY 3rd Party Extension: pgpdf : PDF type with meta admin & Full-Text Search

Information

Metadata

  • Latest Version: 0.1.0
  • Postgres Support: 17,16,15,14,13
  • Need Load: Explicit Loading Required
  • Need DDL: Need CREATE EXTENSION DDL
  • Relocatable: Can not install to arbitrary schema
  • Trusted: Trusted, Can be created by user with CREATE Privilege
  • Schemas: N/A
  • Requires: N/A

RPM / DEB

  • RPM Repo: PIGSTY
  • RPM Name: pgpdf_$v*
  • RPM Ver : 0.1.0
  • RPM Deps: N/A
  • DEB Repo: PIGSTY
  • DEB Name: postgresql-$v-pgpdf
  • DEB Ver : 0.1.0
  • DEB Deps: N/A

Availability

OS Arch PG17 PG16 PG15 PG14 PG13
el8 x86_64 pgpdf_17
PIGSTY 0.1.0
pgpdf_16
PIGSTY 0.1.0
pgpdf_15
PIGSTY 0.1.0
pgpdf_14
PIGSTY 0.1.0
pgpdf_13
PIGSTY 0.1.0
el8 aarch64 pgpdf_17
PIGSTY 0.1.0
pgpdf_16
PIGSTY 0.1.0
pgpdf_15
PIGSTY 0.1.0
pgpdf_14
PIGSTY 0.1.0
pgpdf_13
PIGSTY 0.1.0
el9 x86_64 pgpdf_17
PIGSTY 0.1.0
pgpdf_16
PIGSTY 0.1.0
pgpdf_15
PIGSTY 0.1.0
pgpdf_14
PIGSTY 0.1.0
pgpdf_13
PIGSTY 0.1.0
el9 aarch64 pgpdf_17
PIGSTY 0.1.0
pgpdf_16
PIGSTY 0.1.0
pgpdf_15
PIGSTY 0.1.0
pgpdf_14
PIGSTY 0.1.0
pgpdf_13
PIGSTY 0.1.0
d12 x86_64 postgresql-17-pgpdf
PIGSTY 0.1.0
postgresql-16-pgpdf
PIGSTY 0.1.0
postgresql-15-pgpdf
PIGSTY 0.1.0
postgresql-14-pgpdf
PIGSTY 0.1.0
postgresql-13-pgpdf
PIGSTY 0.1.0
d12 aarch64 postgresql-17-pgpdf
PIGSTY 0.1.0
postgresql-16-pgpdf
PIGSTY 0.1.0
postgresql-15-pgpdf
PIGSTY 0.1.0
postgresql-14-pgpdf
PIGSTY 0.1.0
postgresql-13-pgpdf
PIGSTY 0.1.0
u22 x86_64 postgresql-17-pgpdf
PIGSTY 0.1.0
postgresql-16-pgpdf
PIGSTY 0.1.0
postgresql-15-pgpdf
PIGSTY 0.1.0
postgresql-14-pgpdf
PIGSTY 0.1.0
postgresql-13-pgpdf
PIGSTY 0.1.0
u22 aarch64 postgresql-17-pgpdf
PIGSTY 0.1.0
postgresql-16-pgpdf
PIGSTY 0.1.0
postgresql-15-pgpdf
PIGSTY 0.1.0
postgresql-14-pgpdf
PIGSTY 0.1.0
postgresql-13-pgpdf
PIGSTY 0.1.0
u24 x86_64 postgresql-17-pgpdf
PIGSTY 0.1.0
postgresql-16-pgpdf
PIGSTY 0.1.0
postgresql-15-pgpdf
PIGSTY 0.1.0
postgresql-14-pgpdf
PIGSTY 0.1.0
postgresql-13-pgpdf
PIGSTY 0.1.0
u24 aarch64 postgresql-17-pgpdf
PIGSTY 0.1.0
postgresql-16-pgpdf
PIGSTY 0.1.0
postgresql-15-pgpdf
PIGSTY 0.1.0
postgresql-14-pgpdf
PIGSTY 0.1.0
postgresql-13-pgpdf
PIGSTY 0.1.0

Installation

Install pgpdf via the pig CLI tool:

pig ext install pgpdf

Install pgpdf via Pigsty playbook:

./pgsql.yml -t pg_extension -e '{"pg_extensions": ["pgpdf"]}' # -l <cls>

Install pgpdf RPM from YUM repo directly:

dnf install pgpdf_17*; dnf install pgpdf_16*; dnf install pgpdf_15*; dnf install pgpdf_14*; dnf install pgpdf_13*;

Install pgpdf DEB from APT repo directly:

apt install postgresql-17-pgpdf; apt install postgresql-16-pgpdf; apt install postgresql-15-pgpdf; apt install postgresql-14-pgpdf; apt install postgresql-13-pgpdf;

Extension pgpdf has to be loaded via shared_preload_libraries

shared_preload_libraries = 'pgpdf'; # add to pg cluster config

Create pgpdf extension on PostgreSQL cluster:

CREATE EXTENSION pgpdf;

Usage

The actual PDF parsing is done by poppler.

This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.

Download some PDFs.

wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf

You can create a pdf type, by casting either a text filepath or bytea column.

CREATE EXTENSION pgpdf; SELECT '/tmp/pgintro.pdf'::pdf;
pdf ---------------------------------------------------------------------------------- PostgreSQL Introduction + Digoal.Zhou + 7/20/2011Catalog + PostgreSQL Origin

If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea column, you can just cast it to pdf.

SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;

Examples

Create a table with a pdf column:

CREATE TABLE pdfs(name text primary key, doc pdf); INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf'); INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');

Parsing and validation should happen automatically. The files will be read from the disk only once!

[!NOTE] The filepath should be accessible by the postgres process / user! That’s different than the user running psql. If you don’t understand what this means, as your DBA!

String Functions and Operators

Standard Postgres String Functions and Operators should work as usual:

SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;
SELECT upper('/tmp/pgintro.pdf'::pdf::text);
SELECT name FROM pdfs WHERE doc::text LIKE '%Postgres%';

Full-Text Search (FTS)

You can also perform full-text search (FTS), since you can work on a pdf file like normal text.

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
?column? ---------- t (1 row)
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');
?column? ---------- f (1 row)

Document similarity with pg_trgm

You can use pg_trgm to get the similarity between two documents:

CREATE EXTENSION pg_trgm; SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);

Metadata

The following functions are available:

  • pdf_title(pdf) → text

  • pdf_author(pdf) → text

  • pdf_num_pages(pdf) → integer

    Total number of pages in the document

  • pdf_page(pdf, integer) → text

    Get the i-th page as text

  • pdf_creator(pdf) → text

  • pdf_keywords(pdf) → text

  • pdf_metadata(pdf) → text

  • pdf_version(pdf) → text

  • pdf_subject(pdf) → text

  • pdf_creation(pdf) → timestamp

  • pdf_modification(pdf) → timestamp

SELECT pdf_title('/tmp/pgintro.pdf');
pdf_title ------------------------- PostgreSQL Introduction (1 row)
SELECT pdf_author('/tmp/pgintro.pdf');
pdf_author ------------ 周正中 (1 row)

Getting a subset of pages

SELECT pdf_num_pages('/tmp/pgintro.pdf');
pdf_num_pages --------------- 24 (1 row)
SELECT pdf_page('/tmp/pgintro.pdf', 1);
pdf_page ------------------------------ Catalog + PostgreSQL Origin + Layout + Features + Enterprise Class Attribute+ Case (1 row)
SELECT pdf_subject('/tmp/pgintro.pdf');
pdf_subject ------------- (1 row)
SELECT pdf_creation('/tmp/pgintro.pdf');
pdf_creation -------------------------- Wed Jul 20 11:13:37 2011 (1 row)
SELECT pdf_modification('/tmp/pgintro.pdf');
pdf_modification -------------------------- Wed Jul 20 11:13:37 2011 (1 row)
SELECT pdf_creator('/tmp/pgintro.pdf');
pdf_creator ------------------------------------ Microsoft® Office PowerPoint® 2007 (1 row)
SELECT pdf_metadata('/tmp/pgintro.pdf');
pdf_metadata -------------- (1 row)
SELECT pdf_version('/tmp/pgintro.pdf');
pdf_version ------------- PDF-1.5 (1 row)

Installation

Install poppler dependencies

Linux

sudo apt install -y libpoppler-glib-dev pkg-config

Homebrew/MacOS

brew install poppler pkgconf
cd /tmp git clone https://github.com/Florents-Tselai/pgpdf.git cd pgpdf make make install # may need sudo

After the installation, in a session:

CREATE EXTENSION pgpdf;

Docker

Get the Docker image with:

docker pull florents/pgpdf:pg17

This adds pgpdf to the Postgres image (replace 17 with your Postgres server version, and run it the same way).

Run the image in a container.

docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17

Through another terminal, connect to the running server (container).

PGPASSWORD=pass psql -h localhost -p 5432 -U postgres

[!WARNING] Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.





Last modified 2025-03-21: use global url (cc35107)