vchord_bm25

A postgresql extension for bm25 ranking algorithm

Overview

PackageVersionCategoryLicenseLanguage
vchord_bm250.3.0FTSAGPL-3.0Rust
IDExtensionBinLibLoadCreateTrustRelocSchema
2150vchord_bm25NoYesYesYesNoNobm25_catalog
Relatedvector vchord pg_search pg_bestmatch vectorscale zhparser pg_tokenizer pgroonga

Version

TypeRepoVersionPG VerPackageDeps
EXTPIGSTY0.3.01817161514vchord_bm25-
RPMPIGSTY0.3.01817161514vchord_bm25_$v-
DEBPIGSTY0.3.01817161514postgresql-$v-vchord-bm25-
OS / PGPG18PG17PG16PG15PG14
el8.x86_64
el8.aarch64
el9.x86_64
el9.aarch64
el10.x86_64
el10.aarch64
d12.x86_64
d12.aarch64
d13.x86_64
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
d13.aarch64
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
u22.x86_64
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
u22.aarch64
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
u24.x86_64
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
u24.aarch64
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0
PIGSTY 0.3.0

Build

You can build the RPM / DEB packages for vchord_bm25 using pig build:

pig build pkg vchord_bm25         # build RPM / DEB packages

Install

You can install vchord_bm25 directly. First, make sure the PGDG and PIGSTY repositories are added and enabled:

pig repo add pgsql -u          # Add repo and update cache

Install the extension using pig or apt/yum/dnf:

pig install vchord_bm25;          # Install for current active PG version
pig ext install -y vchord_bm25 -v 18  # PG 18
pig ext install -y vchord_bm25 -v 17  # PG 17
pig ext install -y vchord_bm25 -v 16  # PG 16
pig ext install -y vchord_bm25 -v 15  # PG 15
pig ext install -y vchord_bm25 -v 14  # PG 14
dnf install -y vchord_bm25_18       # PG 18
dnf install -y vchord_bm25_17       # PG 17
dnf install -y vchord_bm25_16       # PG 16
dnf install -y vchord_bm25_15       # PG 15
dnf install -y vchord_bm25_14       # PG 14
apt install -y postgresql-18-vchord-bm25   # PG 18
apt install -y postgresql-17-vchord-bm25   # PG 17
apt install -y postgresql-16-vchord-bm25   # PG 16
apt install -y postgresql-15-vchord-bm25   # PG 15
apt install -y postgresql-14-vchord-bm25   # PG 14

Preload:

shared_preload_libraries = 'vchord_bm25';

Create Extension:

CREATE EXTENSION vchord_bm25;

Usage

GitHub: tensorchord/VectorChord-bm25

VectorChord-BM25 is a PostgreSQL extension for the BM25 ranking algorithm, implemented via Block-WeakAnd algorithms. It is designed to work together with pg_tokenizer for customized text tokenization.

Architecture

The extension comprises three main components:

  1. Tokenizer: Converts text into bm25vector (sparse vectors storing vocabulary IDs and term frequencies)
  2. bm25vector: A custom data type for storing tokenized text
  3. bm25vector indexes: Accelerate search and ranking operations

Quick Start

-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS pg_tokenizer CASCADE;
CREATE EXTENSION IF NOT EXISTS vchord_bm25 CASCADE;

-- Create a tokenizer (e.g., LLMLingua2 for English)
SELECT create_tokenizer('tokenizer1', $$
model = "llmlingua2"
$$);

-- Create a table with text content
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  passage TEXT,
  embedding bm25vector
);

-- Tokenize text passages into bm25vectors
UPDATE documents SET embedding = tokenize(passage, 'tokenizer1');

-- Create a BM25 index
CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);

-- Query with BM25 ranking
SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('search query', 'tokenizer1')) AS score
FROM documents
ORDER BY score
LIMIT 10;

Note: BM25 scores in VectorChord-BM25 are negative, with more negative scores indicating greater relevance.

The <&> Operator

The <&> operator computes the BM25 relevance score between a stored bm25vector and a query bm25vector. Queries must be wrapped in to_bm25query() which takes the index name and the tokenized query:

-- Basic search query
-- to_bm25query(index_name, tokenized_query)
SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('database system', 'tokenizer1')) AS score
FROM documents
ORDER BY score
LIMIT 10;

Language Support

VectorChord-BM25 supports multiple languages through different tokenizer configurations:

LanguageApproachModel/Pre-tokenizer
EnglishPre-trained modelmodel = "llmlingua2" or model = "bert_base_uncased"
ChineseCustom model with Jieba pre-tokenizer[pre_tokenizer.jieba]
JapaneseCustom model with Lindera pre-tokenizerLindera with IPADIC dictionary
CustomUser-trained models via text analyzerscreate_custom_model_tokenizer_and_trigger()

Chinese Text Search Example

Chinese text requires a custom model with a Jieba pre-tokenizer (not a pre-trained model):

-- Create a text analyzer with Jieba pre-tokenizer
SELECT create_text_analyzer('zh_text_analyzer', $$
[pre_tokenizer.jieba]
$$);

-- Create a custom model tokenizer that trains on your corpus
SELECT create_custom_model_tokenizer_and_trigger(
    tokenizer_name => 'zh_tokenizer',
    model_name => 'zh_model',
    text_analyzer_name => 'zh_text_analyzer',
    table_name => 'documents',
    source_column => 'passage',
    target_column => 'embedding'
);

Custom Tokenizer Models

For domain-specific terminology, you can create text analyzers with stopwords, stemming, and other filters, then train custom models on your corpus using create_custom_model_tokenizer_and_trigger().

Comparison with Alternatives

FeatureVectorChord-BM25PostgreSQL tsvector + ts_rank
Ranking algorithmBM25tf-idf variant
Custom tokenizersYes (via pg_tokenizer)Limited to built-in configs
Index typeDedicated BM25 indexGIN index
Native PostgreSQLYes (extension)Built-in
Language supportExtensible via modelsVia text search configs

Last Modified 2026-03-12: add pg extension catalog (95749bf)