pg_bestmatch

Generate BM25 sparse vector inside PostgreSQL

Module:

PGEXT

Overview

PIGSTY 3rd Party Extension: pg_bestmatch : Generate BM25 sparse vector inside PostgreSQL

Information

Extension ID: 2140
Extension Name: pg_bestmatch
Package Name: pg_bestmatch
Category: FTS
License: Apache-2.0
Website: https://github.com/tensorchord/pg_bestmatch.rs
Language: Rust
Extra Tags: pgrx
Comment:

Metadata

Latest Version: 0.0.1
Postgres Support: 17,16,15,14,13
Need Load: Explicit Loading Required
Need DDL: Need CREATE EXTENSION DDL
Relocatable: Can be installed into other schemas
Trusted: Untrusted, Require Superuser to Create
Schemas: bm_catalog
Requires: N/A

RPM / DEB

RPM Repo: PIGSTY
RPM Name: pg_bestmatch_$v
RPM Ver : 0.0.1
RPM Deps: N/A
DEB Repo: PIGSTY
DEB Name: postgresql-$v-pg-bestmatch
DEB Ver : 0.0.1
DEB Deps: N/A

Availability

OS	Arch	PG17	PG16	PG15	PG14	PG13
`el8`	`x86_64`	`pg_bestmatch_17` PIGSTY 0.0.1	`pg_bestmatch_16` PIGSTY 0.0.1	`pg_bestmatch_15` PIGSTY 0.0.1	`pg_bestmatch_14` PIGSTY 0.0.1	`pg_bestmatch_13` PIGSTY 0.0.1
`el8`	`aarch64`	`pg_bestmatch_17` PIGSTY 0.0.1	`pg_bestmatch_16` PIGSTY 0.0.1	`pg_bestmatch_15` PIGSTY 0.0.1	`pg_bestmatch_14` PIGSTY 0.0.1	`pg_bestmatch_13` PIGSTY 0.0.1
`el9`	`x86_64`	`pg_bestmatch_17` PIGSTY 0.0.1	`pg_bestmatch_16` PIGSTY 0.0.1	`pg_bestmatch_15` PIGSTY 0.0.1	`pg_bestmatch_14` PIGSTY 0.0.1	`pg_bestmatch_13` PIGSTY 0.0.1
`el9`	`aarch64`	`pg_bestmatch_17` PIGSTY 0.0.1	`pg_bestmatch_16` PIGSTY 0.0.1	`pg_bestmatch_15` PIGSTY 0.0.1	`pg_bestmatch_14` PIGSTY 0.0.1	`pg_bestmatch_13` PIGSTY 0.0.1
`d12`	`x86_64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`d12`	`aarch64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`u22`	`x86_64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`u22`	`aarch64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`u24`	`x86_64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`u24`	`aarch64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1

Installation

Install pg_bestmatch via the pig CLI tool:

pig ext install pg_bestmatch

Install pg_bestmatch via Pigsty playbook:

./pgsql.yml -t pg_extension -e '{"pg_extensions": ["pg_bestmatch"]}' # -l <cls>

Install pg_bestmatch RPM from YUM repo directly:

dnf install pg_bestmatch_17;
dnf install pg_bestmatch_16;
dnf install pg_bestmatch_15;
dnf install pg_bestmatch_14;
dnf install pg_bestmatch_13;

Install pg_bestmatch DEB from APT repo directly:

apt install postgresql-17-pg-bestmatch;
apt install postgresql-16-pg-bestmatch;
apt install postgresql-15-pg-bestmatch;
apt install postgresql-14-pg-bestmatch;
apt install postgresql-13-pg-bestmatch;

Extension pg_bestmatch has to be loaded via shared_preload_libraries

shared_preload_libraries = 'pg_bestmatch'; # add to pg cluster config

Create pg_bestmatch extension on PostgreSQL cluster:

CREATE EXTENSION pg_bestmatch;

Usage

How does it work?

Create an BM25 statistics based on your document set by bm25_create(table_name, column_name, statistic_name);. It will create a materilized view to record the stats.
Generate document sparse vector by bm25_document_to_svector(statistic_name, passage)
For query, generate query sparse vector bm25_query_to_svector(statistic_name, query)
Calculate the score by dot product between the query sparse vector and the document sparse vector
Currently we use huggingface tokenizer with bert-base-uncased vocabulary set to tokenize words. Might support more configuration on tokenizer in the future.

Install

CREATE EXTENSION pg_bestmatch;
SET search_path TO public, bm_catalog;

Example

Here is an example workflow demonstrating the usage of this extension with the example of Stanford LoCo benchmark.

Load the dataset. Here is a script for you if you want to experience pg_bestmatch with the dataset.

wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Documents/parquet/default/test/0.parquet -O documents.parquet
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Queries/parquet/default/test/0.parquet -O queries.parquet

import pandas as pd
from sqlalchemy import create_engine
import numpy as np
from psycopg2.extensions import register_adapter, AsIs

def adapter_numpy_float64(numpy_float64):
    return AsIs(numpy_float64)

def adapter_numpy_int64(numpy_int64):
    return AsIs(numpy_int64)

def adapter_numpy_float32(numpy_float32):
    return AsIs(numpy_float32)

def adapter_numpy_int32(numpy_int32):
    return AsIs(numpy_int32)

def adapter_numpy_array(numpy_array):
    return AsIs(tuple(numpy_array))

register_adapter(np.float64, adapter_numpy_float64)
register_adapter(np.int64, adapter_numpy_int64)
register_adapter(np.float32, adapter_numpy_float32)
register_adapter(np.int32, adapter_numpy_int32)
register_adapter(np.ndarray, adapter_numpy_array)

db_url = "postgresql://localhost:5432/pg_bestmatch_test"
engine = create_engine(db_url)

def load_documents():
    df = pd.read_parquet("documents.parquet")
    df.to_sql("documents", engine, if_exists='replace', index=False)

def load_queries():
    df = pd.read_parquet("queries.parquet")
    df['answer_pids'] = df['answer_pids'].apply(lambda x: str(x[0]))    
    df.to_sql("queries", engine, if_exists='replace', index=False)

load_documents()
load_queries()

Create BM25 statistics for the documents table.

SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 0.75, 1.2);

Add an embedding column to the documents and queries tables and update the embeddings for documents and queries.

ALTER TABLE documents ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE documents ADD COLUMN embedding sparsevec; -- for pgvector users

UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage)::svector; -- for pgvecto.rs users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage, 'pgvector')::sparsevec; -- for pgvector users

(Optional) Create a vector index on the sparse vector column.

CREATE INDEX ON documents USING vectors (embedding svector_dot_ops); -- for pgvecto.rs users
CREATE INDEX ON documents USING ivfflat (embedding sparsevec_ip_ops); -- for pgvector users

Perform a vector search to find the most relevant documents for each query.

ALTER TABLE queries ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE queries ADD COLUMN embedding sparsevec; -- for pgvector users

UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query)::svector; -- for pgvecto.rs users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query, 'pgvector')::sparsevec; -- for pgvector users

SELECT sum((array[answer_pids] = array(SELECT pid FROM documents WHERE queries.dataset = documents.dataset ORDER BY queries.embedding <#> documents.embedding LIMIT 1))::int) FROM queries;

This workflow showcases how to leverage BM25 text queries and vector search in PostgreSQL using this extension. The Top 1 recall of BM25 on this dataset is 0.77. If you reproduce the result, your operations are correct.

Comparison with pg_search

pg_bestmatch.rs only provides methods for generating sparse vectors and does not support index-based search (which can be achieved by pgvecto.rs or pgvector).
pg_search performs BM25 retrieval via the external tantivy engine, which may have limitations when combined with transactions, filters, or JOIN operations. Since pg_bestmatch.rs is entirely native to Postgres, it offers full compatibility with these operations inside postgres.

Reference

tokenize
- Description: Tokenizes an input string into individual tokens.
- Example:
```
SELECT tokenize('i have an apple'); -- result: {i,have,an,apple}
```
bm25_create
- Description: Creates BM25 statistics for a specified table and column.
- Usage:
```
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25');
```
- Parameters:
  - table_name: Name of the table.
  - column_name: Name of the column.
  - stat_name: Name of the BM25 statistics.
  - b: BM25 parameter (default 0.75).
  - k: BM25 parameter (default 1.2).
bm25_refresh
- Description: Updates the BM25 statistics to reflect any changes in the underlying data.
- Usage:
```
SELECT bm25_refresh('documents_passage_bm25');
```
- Parameters:
  - stat_name: Name of the BM25 statistics to update.
bm25_drop
- Description: Deletes the BM25 statistics for a specified table and column.
- Usage:
```
SELECT bm25_drop('documents_passage_bm25');
```
- Parameters:
  - stat_name: Name of the BM25 statistics to delete.
bm25_document_to_svector
- Description: Converts document text into a sparse vector representation.
- Usage:
```
SELECT bm25_document_to_svector('documents_passage_bm25', 'document_text');
```
- Parameters:
  - stat_name: Name of the BM25 statistics.
  - document_text: The text of the document.
  - style: Emits pgvecto.rs-style sparse vector or pgvector-style sparse vector.
bm25_query_to_svector
- Description: Converts query text into a sparse vector representation.
- Usage:
```
SELECT bm25_query_to_svector('documents_passage_bm25', 'We begin, as always, with the text.');
```
- Parameters:
  - stat_name: Name of the BM25 statistics.
  - query_text: The text of the query.
  - style: Emits pgvecto.rs-style sparse vector or pgvector-style sparse vector.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified 2025-05-24: add blogpost (6b938c7)