pg_tokenizer

Tokenizers for full-text search

Overview

PackageVersionCategoryLicenseLanguage
pg_tokenizer0.1.1FTSApache-2.0Rust
IDExtensionBinLibLoadCreateTrustRelocSchema
2160pg_tokenizerNoYesNoYesYesNotokenizer_catalog
Relatedpg_search pgroonga pg_bigm zhparser pgroonga_database pg_bestmatch vchord_bm25 pg_trgm

PG18 fix by Vonng

Version

TypeRepoVersionPG VerPackageDeps
EXTPIGSTY0.1.11817161514pg_tokenizer-
RPMPIGSTY0.1.11817161514pg_tokenizer_$v-
DEBPIGSTY0.1.11817161514postgresql-$v-pg-tokenizer-
OS / PGPG18PG17PG16PG15PG14
el8.x86_64
el8.aarch64
el9.x86_64
el9.aarch64
el10.x86_64
el10.aarch64
d12.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
d12.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
d13.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
d13.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
u22.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
u22.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
u24.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
u24.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1

Build

You can build the RPM / DEB packages for pg_tokenizer using pig build:

pig build pkg pg_tokenizer         # build RPM / DEB packages

Install

You can install pg_tokenizer directly. First, make sure the PGDG and PIGSTY repositories are added and enabled:

pig repo add pgsql -u          # Add repo and update cache

Install the extension using pig or apt/yum/dnf:

pig install pg_tokenizer;          # Install for current active PG version
pig ext install -y pg_tokenizer -v 18  # PG 18
pig ext install -y pg_tokenizer -v 17  # PG 17
pig ext install -y pg_tokenizer -v 16  # PG 16
pig ext install -y pg_tokenizer -v 15  # PG 15
pig ext install -y pg_tokenizer -v 14  # PG 14
dnf install -y pg_tokenizer_18       # PG 18
dnf install -y pg_tokenizer_17       # PG 17
dnf install -y pg_tokenizer_16       # PG 16
dnf install -y pg_tokenizer_15       # PG 15
dnf install -y pg_tokenizer_14       # PG 14
apt install -y postgresql-18-pg-tokenizer   # PG 18
apt install -y postgresql-17-pg-tokenizer   # PG 17
apt install -y postgresql-16-pg-tokenizer   # PG 16
apt install -y postgresql-15-pg-tokenizer   # PG 15
apt install -y postgresql-14-pg-tokenizer   # PG 14

Create Extension:

CREATE EXTENSION pg_tokenizer;

Usage

GitHub: tensorchord/pg_tokenizer.rs

pg_tokenizer is a PostgreSQL extension that provides tokenizers for full-text search. It is designed to work with VectorChord-bm25 for native BM25 ranking index support.

Quick Start

CREATE EXTENSION pg_tokenizer;

-- Create a tokenizer using the LLMLingua2 model
SELECT create_tokenizer('tokenizer1', $$
model = "llmlingua2"
$$);

-- Tokenize text
SELECT tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1');

Tokenizer Models

pg_tokenizer supports multiple tokenizer models for different languages and use cases:

ModelLanguageDescription
llmlingua2EnglishBERT-based tokenizer from LLMLingua2
jiebaChineseJieba Chinese text segmentation
lindera/ipadicJapaneseLindera tokenizer with IPADIC dictionary
Custom modelsAnyUser-trained models for domain-specific text

Creating Tokenizers

-- English tokenizer
SELECT create_tokenizer('en_tokenizer', $$
model = "llmlingua2"
$$);

-- Chinese tokenizer
SELECT create_tokenizer('zh_tokenizer', $$
model = "jieba"
$$);

-- Japanese tokenizer
SELECT create_tokenizer('ja_tokenizer', $$
model = "lindera/ipadic"
$$);

Tokenizing Text

-- Tokenize English text
SELECT tokenize('full text search in PostgreSQL', 'en_tokenizer');

-- Tokenize Chinese text
SELECT tokenize('PostgreSQL是一个强大的数据库系统', 'zh_tokenizer');

Text Analyzer

pg_tokenizer also provides text analyzer functionality that combines tokenization with additional text processing steps. For detailed text analyzer usage, refer to the Text Analyzer documentation.

Integration with VectorChord-BM25

pg_tokenizer is typically used together with VectorChord-BM25 for full BM25 ranking support:

CREATE EXTENSION IF NOT EXISTS pg_tokenizer CASCADE;
CREATE EXTENSION IF NOT EXISTS vchord_bm25 CASCADE;

-- Create a tokenizer
SELECT create_tokenizer('my_tokenizer', $$
model = "llmlingua2"
$$);

-- Tokenize text into bm25vectors for indexing and search
SELECT tokenize('your search query', 'my_tokenizer');

Documentation

For more details, see the full documentation:


Last Modified 2026-03-12: add pg extension catalog (95749bf)