vchord_bm25
Overview
| Package | Version | Category | License | Language |
|---|---|---|---|---|
vchord_bm25 | 0.3.0 | FTS | AGPL-3.0 | Rust |
| ID | Extension | Bin | Lib | Load | Create | Trust | Reloc | Schema |
|---|---|---|---|---|---|---|---|---|
| 2150 | vchord_bm25 | No | Yes | Yes | Yes | No | No | bm25_catalog |
| Related | vector vchord pg_search pg_bestmatch vectorscale zhparser pg_tokenizer pgroonga |
|---|
Version
| Type | Repo | Version | PG Ver | Package | Deps |
|---|---|---|---|---|---|
| EXT | PIGSTY | 0.3.0 | 1817161514 | vchord_bm25 | - |
| RPM | PIGSTY | 0.3.0 | 1817161514 | vchord_bm25_$v | - |
| DEB | PIGSTY | 0.3.0 | 1817161514 | postgresql-$v-vchord-bm25 | - |
Build
You can build the RPM / DEB packages for vchord_bm25 using pig build:
pig build pkg vchord_bm25 # build RPM / DEB packages
Install
You can install vchord_bm25 directly. First, make sure the PGDG and PIGSTY repositories are added and enabled:
pig repo add pgsql -u # Add repo and update cache
Install the extension using pig or apt/yum/dnf:
pig install vchord_bm25; # Install for current active PG version
pig ext install -y vchord_bm25 -v 18 # PG 18
pig ext install -y vchord_bm25 -v 17 # PG 17
pig ext install -y vchord_bm25 -v 16 # PG 16
pig ext install -y vchord_bm25 -v 15 # PG 15
pig ext install -y vchord_bm25 -v 14 # PG 14
dnf install -y vchord_bm25_18 # PG 18
dnf install -y vchord_bm25_17 # PG 17
dnf install -y vchord_bm25_16 # PG 16
dnf install -y vchord_bm25_15 # PG 15
dnf install -y vchord_bm25_14 # PG 14
apt install -y postgresql-18-vchord-bm25 # PG 18
apt install -y postgresql-17-vchord-bm25 # PG 17
apt install -y postgresql-16-vchord-bm25 # PG 16
apt install -y postgresql-15-vchord-bm25 # PG 15
apt install -y postgresql-14-vchord-bm25 # PG 14
Preload:
shared_preload_libraries = 'vchord_bm25';
Create Extension:
CREATE EXTENSION vchord_bm25;
Usage
VectorChord-BM25 is a PostgreSQL extension for the BM25 ranking algorithm, implemented via Block-WeakAnd algorithms. It is designed to work together with pg_tokenizer for customized text tokenization.
Architecture
The extension comprises three main components:
- Tokenizer: Converts text into
bm25vector(sparse vectors storing vocabulary IDs and term frequencies) - bm25vector: A custom data type for storing tokenized text
- bm25vector indexes: Accelerate search and ranking operations
Quick Start
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS pg_tokenizer CASCADE;
CREATE EXTENSION IF NOT EXISTS vchord_bm25 CASCADE;
-- Create a tokenizer (e.g., LLMLingua2 for English)
SELECT create_tokenizer('tokenizer1', $$
model = "llmlingua2"
$$);
-- Create a table with text content
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
passage TEXT,
embedding bm25vector
);
-- Tokenize text passages into bm25vectors
UPDATE documents SET embedding = tokenize(passage, 'tokenizer1');
-- Create a BM25 index
CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);
-- Query with BM25 ranking
SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('search query', 'tokenizer1')) AS score
FROM documents
ORDER BY score
LIMIT 10;
Note: BM25 scores in VectorChord-BM25 are negative, with more negative scores indicating greater relevance.
The <&> Operator
The <&> operator computes the BM25 relevance score between a stored bm25vector and a query bm25vector. Queries must be wrapped in to_bm25query() which takes the index name and the tokenized query:
-- Basic search query
-- to_bm25query(index_name, tokenized_query)
SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('database system', 'tokenizer1')) AS score
FROM documents
ORDER BY score
LIMIT 10;
Language Support
VectorChord-BM25 supports multiple languages through different tokenizer configurations:
| Language | Approach | Model/Pre-tokenizer |
|---|---|---|
| English | Pre-trained model | model = "llmlingua2" or model = "bert_base_uncased" |
| Chinese | Custom model with Jieba pre-tokenizer | [pre_tokenizer.jieba] |
| Japanese | Custom model with Lindera pre-tokenizer | Lindera with IPADIC dictionary |
| Custom | User-trained models via text analyzers | create_custom_model_tokenizer_and_trigger() |
Chinese Text Search Example
Chinese text requires a custom model with a Jieba pre-tokenizer (not a pre-trained model):
-- Create a text analyzer with Jieba pre-tokenizer
SELECT create_text_analyzer('zh_text_analyzer', $$
[pre_tokenizer.jieba]
$$);
-- Create a custom model tokenizer that trains on your corpus
SELECT create_custom_model_tokenizer_and_trigger(
tokenizer_name => 'zh_tokenizer',
model_name => 'zh_model',
text_analyzer_name => 'zh_text_analyzer',
table_name => 'documents',
source_column => 'passage',
target_column => 'embedding'
);
Custom Tokenizer Models
For domain-specific terminology, you can create text analyzers with stopwords, stemming, and other filters, then train custom models on your corpus using create_custom_model_tokenizer_and_trigger().
Comparison with Alternatives
| Feature | VectorChord-BM25 | PostgreSQL tsvector + ts_rank |
|---|---|---|
| Ranking algorithm | BM25 | tf-idf variant |
| Custom tokenizers | Yes (via pg_tokenizer) | Limited to built-in configs |
| Index type | Dedicated BM25 index | GIN index |
| Native PostgreSQL | Yes (extension) | Built-in |
| Language support | Extensible via models | Via text search configs |
Feedback
Was this page helpful?
Thanks for the feedback! Please let us know how we can improve.
Sorry to hear that. Please let us know how we can improve.