pgpdf
Module:
Categories:
Overview
PIGSTY 3rd Party Extension: pgpdf
: PDF type with meta admin & Full-Text Search
Information
- Extension ID: 3530
- Extension Name:
pgpdf
- Package Name:
pgpdf
- Category:
TYPE
- License: GPLv3
- Website: https://github.com/Florents-Tselai/pgpdf
- Language: C
- Extra Tags: N/A
- Comment: N/A
Metadata
- Latest Version: 0.1.0
- Postgres Support:
17
,16
,15
,14
,13
- Need Load: Explicit Loading Required
- Need DDL: Need
CREATE EXTENSION
DDL - Relocatable: Can not install to arbitrary schema
- Trusted: Trusted, Can be created by user with
CREATE
Privilege - Schemas: N/A
- Requires: N/A
RPM / DEB
- RPM Repo: PIGSTY
- RPM Name:
pgpdf_$v*
- RPM Ver :
0.1.0
- RPM Deps: N/A
- DEB Repo: PIGSTY
- DEB Name:
postgresql-$v-pgpdf
- DEB Ver :
0.1.0
- DEB Deps: N/A
Packages
OS | Arch | PG17 | PG16 | PG15 | PG14 | PG13 |
---|---|---|---|---|---|---|
el8 |
x86_64 |
pgpdf_17 PIGSTY 0.1.0 |
pgpdf_16 PIGSTY 0.1.0 |
pgpdf_15 PIGSTY 0.1.0 |
pgpdf_14 PIGSTY 0.1.0 |
pgpdf_13 PIGSTY 0.1.0 |
el8 |
aarch64 |
pgpdf_17 PIGSTY 0.1.0 |
pgpdf_16 PIGSTY 0.1.0 |
pgpdf_15 PIGSTY 0.1.0 |
pgpdf_14 PIGSTY 0.1.0 |
pgpdf_13 PIGSTY 0.1.0 |
el9 |
x86_64 |
pgpdf_17 PIGSTY 0.1.0 |
pgpdf_16 PIGSTY 0.1.0 |
pgpdf_15 PIGSTY 0.1.0 |
pgpdf_14 PIGSTY 0.1.0 |
pgpdf_13 PIGSTY 0.1.0 |
el9 |
aarch64 |
pgpdf_17 PIGSTY 0.1.0 |
pgpdf_16 PIGSTY 0.1.0 |
pgpdf_15 PIGSTY 0.1.0 |
pgpdf_14 PIGSTY 0.1.0 |
pgpdf_13 PIGSTY 0.1.0 |
d12 |
x86_64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
d12 |
aarch64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
u22 |
x86_64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
u22 |
aarch64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
u24 |
x86_64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
u24 |
aarch64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
Installation
Install pgpdf
via the pig
CLI tool:
pig ext install pgpdf
Install pgpdf
via Pigsty playbook:
./pgsql.yml -t pg_extension -e '{"pg_extensions": ["pgpdf"]}' # -l <cls>
Install pgpdf
RPM from YUM repo directly:
dnf install pgpdf_17*;
dnf install pgpdf_16*;
dnf install pgpdf_15*;
dnf install pgpdf_14*;
dnf install pgpdf_13*;
Install pgpdf
DEB from APT repo directly:
apt install postgresql-17-pgpdf;
apt install postgresql-16-pgpdf;
apt install postgresql-15-pgpdf;
apt install postgresql-14-pgpdf;
apt install postgresql-13-pgpdf;
Extension pgpdf
has to be added to shared_preload_libraries
shared_preload_libraries = 'pgpdf'; # add to pg cluster config
Enable pgpdf
extension on PostgreSQL cluster:
CREATE EXTENSION pgpdf;
Usage
The actual PDF parsing is done by poppler.
This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.
Download some PDFs.
wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf
You can create a pdf
type, by casting either a text
filepath or bytea
column.
CREATE EXTENSION pgpdf;
SELECT '/tmp/pgintro.pdf'::pdf;
pdf
----------------------------------------------------------------------------------
PostgreSQL Introduction +
Digoal.Zhou +
7/20/2011Catalog +
PostgreSQL Origin
If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea
column, you can just cast it to pdf
.
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;
Examples
Create a table with a pdf
column:
CREATE TABLE pdfs(name text primary key, doc pdf);
INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');
Parsing and validation should happen automatically. The files will be read from the disk only once!
[!NOTE] The filepath should be accessible by the
postgres
process / user! That’s different than the user running psql. If you don’t understand what this means, as your DBA!
String Functions and Operators
Standard Postgres String Functions and Operators should work as usual:
SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;
SELECT upper('/tmp/pgintro.pdf'::pdf::text);
SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';
Full-Text Search (FTS)
You can also perform full-text search (FTS), since you can work on a pdf
file like normal text.
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
?column?
----------
t
(1 row)
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');
?column?
----------
f
(1 row)
Document similarity with pg_trgm
You can use pg_trgm to get the similarity between two documents:
CREATE EXTENSION pg_trgm;
SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);
Metadata
The following functions are available:
-
pdf_title(pdf) → text
-
pdf_author(pdf) → text
-
pdf_num_pages(pdf) → integer
Total number of pages in the document
-
pdf_page(pdf, integer) → text
Get the i-th page as text
-
pdf_creator(pdf) → text
-
pdf_keywords(pdf) → text
-
pdf_metadata(pdf) → text
-
pdf_version(pdf) → text
-
pdf_subject(pdf) → text
-
pdf_creation(pdf) → timestamp
-
pdf_modification(pdf) → timestamp
SELECT pdf_title('/tmp/pgintro.pdf');
pdf_title
-------------------------
PostgreSQL Introduction
(1 row)
SELECT pdf_author('/tmp/pgintro.pdf');
pdf_author
------------
周正中
(1 row)
Getting a subset of pages
SELECT pdf_num_pages('/tmp/pgintro.pdf');
pdf_num_pages
---------------
24
(1 row)
SELECT pdf_page('/tmp/pgintro.pdf', 1);
pdf_page
------------------------------
Catalog +
PostgreSQL Origin +
Layout +
Features +
Enterprise Class Attribute+
Case
(1 row)
SELECT pdf_subject('/tmp/pgintro.pdf');
pdf_subject
-------------
(1 row)
SELECT pdf_creation('/tmp/pgintro.pdf');
pdf_creation
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_modification('/tmp/pgintro.pdf');
pdf_modification
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_creator('/tmp/pgintro.pdf');
pdf_creator
------------------------------------
Microsoft® Office PowerPoint® 2007
(1 row)
SELECT pdf_metadata('/tmp/pgintro.pdf');
pdf_metadata
--------------
(1 row)
SELECT pdf_version('/tmp/pgintro.pdf');
pdf_version
-------------
PDF-1.5
(1 row)
Installation
Install poppler dependencies
Linux
sudo apt install -y libpoppler-glib-dev pkg-config
Homebrew/MacOS
brew install poppler pkgconf
cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo
After the installation, in a session:
CREATE EXTENSION pgpdf;
Docker
Get the Docker image with:
docker pull florents/pgpdf:pg17
This adds pgpdf to the Postgres image (replace 17
with your Postgres server version, and run it the same way).
Run the image in a container.
docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17
Through another terminal, connect to the running server (container).
PGPASSWORD=pass psql -h localhost -p 5432 -U postgres
[!WARNING] Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.