pg_tokenizer
Overview
| Package | Version | Category | License | Language |
|---|---|---|---|---|
pg_tokenizer | 0.1.1 | FTS | Apache-2.0 | Rust |
| ID | Extension | Bin | Lib | Load | Create | Trust | Reloc | Schema |
|---|---|---|---|---|---|---|---|---|
| 2160 | pg_tokenizer | No | Yes | No | Yes | Yes | No | tokenizer_catalog |
| Related | pg_search pgroonga pg_bigm zhparser pgroonga_database pg_bestmatch vchord_bm25 pg_trgm |
|---|
PG18 fix by Vonng
Version
| Type | Repo | Version | PG Ver | Package | Deps |
|---|---|---|---|---|---|
| EXT | PIGSTY | 0.1.1 | 1817161514 | pg_tokenizer | - |
| RPM | PIGSTY | 0.1.1 | 1817161514 | pg_tokenizer_$v | - |
| DEB | PIGSTY | 0.1.1 | 1817161514 | postgresql-$v-pg-tokenizer | - |
Build
You can build the RPM / DEB packages for pg_tokenizer using pig build:
pig build pkg pg_tokenizer # build RPM / DEB packages
Install
You can install pg_tokenizer directly. First, make sure the PGDG and PIGSTY repositories are added and enabled:
pig repo add pgsql -u # Add repo and update cache
Install the extension using pig or apt/yum/dnf:
pig install pg_tokenizer; # Install for current active PG version
pig ext install -y pg_tokenizer -v 18 # PG 18
pig ext install -y pg_tokenizer -v 17 # PG 17
pig ext install -y pg_tokenizer -v 16 # PG 16
pig ext install -y pg_tokenizer -v 15 # PG 15
pig ext install -y pg_tokenizer -v 14 # PG 14
dnf install -y pg_tokenizer_18 # PG 18
dnf install -y pg_tokenizer_17 # PG 17
dnf install -y pg_tokenizer_16 # PG 16
dnf install -y pg_tokenizer_15 # PG 15
dnf install -y pg_tokenizer_14 # PG 14
apt install -y postgresql-18-pg-tokenizer # PG 18
apt install -y postgresql-17-pg-tokenizer # PG 17
apt install -y postgresql-16-pg-tokenizer # PG 16
apt install -y postgresql-15-pg-tokenizer # PG 15
apt install -y postgresql-14-pg-tokenizer # PG 14
Create Extension:
CREATE EXTENSION pg_tokenizer;
Usage
pg_tokenizer is a PostgreSQL extension that provides tokenizers for full-text search. It is designed to work with VectorChord-bm25 for native BM25 ranking index support.
Quick Start
CREATE EXTENSION pg_tokenizer;
-- Create a tokenizer using the LLMLingua2 model
SELECT create_tokenizer('tokenizer1', $$
model = "llmlingua2"
$$);
-- Tokenize text
SELECT tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1');
Tokenizer Models
pg_tokenizer supports multiple tokenizer models for different languages and use cases:
| Model | Language | Description |
|---|---|---|
llmlingua2 | English | BERT-based tokenizer from LLMLingua2 |
jieba | Chinese | Jieba Chinese text segmentation |
lindera/ipadic | Japanese | Lindera tokenizer with IPADIC dictionary |
| Custom models | Any | User-trained models for domain-specific text |
Creating Tokenizers
-- English tokenizer
SELECT create_tokenizer('en_tokenizer', $$
model = "llmlingua2"
$$);
-- Chinese tokenizer
SELECT create_tokenizer('zh_tokenizer', $$
model = "jieba"
$$);
-- Japanese tokenizer
SELECT create_tokenizer('ja_tokenizer', $$
model = "lindera/ipadic"
$$);
Tokenizing Text
-- Tokenize English text
SELECT tokenize('full text search in PostgreSQL', 'en_tokenizer');
-- Tokenize Chinese text
SELECT tokenize('PostgreSQL是一个强大的数据库系统', 'zh_tokenizer');
Text Analyzer
pg_tokenizer also provides text analyzer functionality that combines tokenization with additional text processing steps. For detailed text analyzer usage, refer to the Text Analyzer documentation.
Integration with VectorChord-BM25
pg_tokenizer is typically used together with VectorChord-BM25 for full BM25 ranking support:
CREATE EXTENSION IF NOT EXISTS pg_tokenizer CASCADE;
CREATE EXTENSION IF NOT EXISTS vchord_bm25 CASCADE;
-- Create a tokenizer
SELECT create_tokenizer('my_tokenizer', $$
model = "llmlingua2"
$$);
-- Tokenize text into bm25vectors for indexing and search
SELECT tokenize('your search query', 'my_tokenizer');
Documentation
For more details, see the full documentation:
Feedback
Was this page helpful?
Thanks for the feedback! Please let us know how we can improve.
Sorry to hear that. Please let us know how we can improve.