pg_bestmatch
Overview
| Package | Version | Category | License | Language |
|---|---|---|---|---|
pg_bestmatch | 0.0.2 | FTS | Apache-2.0 | Rust |
| ID | Extension | Bin | Lib | Load | Create | Trust | Reloc | Schema |
|---|---|---|---|---|---|---|---|---|
| 2140 | pg_bestmatch | No | Yes | Yes | Yes | No | No | bm_catalog |
| Related | vector pg_search vchord_bm25 vchord vectorscale zhparser pg_tokenizer vectorize |
|---|
manual updated pgrx by Vonng
Version
| Type | Repo | Version | PG Ver | Package | Deps |
|---|---|---|---|---|---|
| EXT | PIGSTY | 0.0.2 | 1817161514 | pg_bestmatch | - |
| RPM | PIGSTY | 0.0.2 | 1817161514 | pg_bestmatch_$v | - |
| DEB | PIGSTY | 0.0.2 | 1817161514 | postgresql-$v-pg-bestmatch | - |
Build
You can build the RPM / DEB packages for pg_bestmatch using pig build:
pig build pkg pg_bestmatch # build RPM / DEB packages
Install
You can install pg_bestmatch directly. First, make sure the PGDG and PIGSTY repositories are added and enabled:
pig repo add pgsql -u # Add repo and update cache
Install the extension using pig or apt/yum/dnf:
pig install pg_bestmatch; # Install for current active PG version
pig ext install -y pg_bestmatch -v 18 # PG 18
pig ext install -y pg_bestmatch -v 17 # PG 17
pig ext install -y pg_bestmatch -v 16 # PG 16
pig ext install -y pg_bestmatch -v 15 # PG 15
pig ext install -y pg_bestmatch -v 14 # PG 14
dnf install -y pg_bestmatch_18 # PG 18
dnf install -y pg_bestmatch_17 # PG 17
dnf install -y pg_bestmatch_16 # PG 16
dnf install -y pg_bestmatch_15 # PG 15
dnf install -y pg_bestmatch_14 # PG 14
apt install -y postgresql-18-pg-bestmatch # PG 18
apt install -y postgresql-17-pg-bestmatch # PG 17
apt install -y postgresql-16-pg-bestmatch # PG 16
apt install -y postgresql-15-pg-bestmatch # PG 15
apt install -y postgresql-14-pg-bestmatch # PG 14
Preload:
shared_preload_libraries = 'pg_bestmatch';
Create Extension:
CREATE EXTENSION pg_bestmatch;
Usage
- repo: https://github.com/tensorchord/pg_bestmatch.rs
- benchmark: https://hazyresearch.stanford.edu/blog/2024-05-20-m2-bert-retrieval
How does it work?
- Create an BM25 statistics based on your document set by
bm25_create(table_name, column_name, statistic_name);. It will create a materilized view to record the stats. - Generate document sparse vector by
bm25_document_to_svector(statistic_name, passage) - For query, generate query sparse vector
bm25_query_to_svector(statistic_name, query) - Calculate the score by dot product between the query sparse vector and the document sparse vector
- Currently we use huggingface tokenizer with
bert-base-uncasedvocabulary set to tokenize words. Might support more configuration on tokenizer in the future.
Install
CREATE EXTENSION pg_bestmatch;
SET search_path TO public, bm_catalog;
Example
Here is an example workflow demonstrating the usage of this extension with the example of Stanford LoCo benchmark.
- Load the dataset. Here is a script for you if you want to experience
pg_bestmatchwith the dataset.
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Documents/parquet/default/test/0.parquet -O documents.parquet
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Queries/parquet/default/test/0.parquet -O queries.parquet
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def adapter_numpy_float64(numpy_float64):
return AsIs(numpy_float64)
def adapter_numpy_int64(numpy_int64):
return AsIs(numpy_int64)
def adapter_numpy_float32(numpy_float32):
return AsIs(numpy_float32)
def adapter_numpy_int32(numpy_int32):
return AsIs(numpy_int32)
def adapter_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.float64, adapter_numpy_float64)
register_adapter(np.int64, adapter_numpy_int64)
register_adapter(np.float32, adapter_numpy_float32)
register_adapter(np.int32, adapter_numpy_int32)
register_adapter(np.ndarray, adapter_numpy_array)
db_url = "postgresql://localhost:5432/pg_bestmatch_test"
engine = create_engine(db_url)
def load_documents():
df = pd.read_parquet("documents.parquet")
df.to_sql("documents", engine, if_exists='replace', index=False)
def load_queries():
df = pd.read_parquet("queries.parquet")
df['answer_pids'] = df['answer_pids'].apply(lambda x: str(x[0]))
df.to_sql("queries", engine, if_exists='replace', index=False)
load_documents()
load_queries()
- Create BM25 statistics for the
documentstable.
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 0.75, 1.2);
- Add an embedding column to the
documentsandqueriestables and update the embeddings for documents and queries.
ALTER TABLE documents ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE documents ADD COLUMN embedding sparsevec; -- for pgvector users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage)::svector; -- for pgvecto.rs users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage, 'pgvector')::sparsevec; -- for pgvector users
- (Optional) Create a vector index on the sparse vector column.
CREATE INDEX ON documents USING vectors (embedding svector_dot_ops); -- for pgvecto.rs users
CREATE INDEX ON documents USING ivfflat (embedding sparsevec_ip_ops); -- for pgvector users
- Perform a vector search to find the most relevant documents for each query.
ALTER TABLE queries ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE queries ADD COLUMN embedding sparsevec; -- for pgvector users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query)::svector; -- for pgvecto.rs users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query, 'pgvector')::sparsevec; -- for pgvector users
SELECT sum((array[answer_pids] = array(SELECT pid FROM documents WHERE queries.dataset = documents.dataset ORDER BY queries.embedding <#> documents.embedding LIMIT 1))::int) FROM queries;
This workflow showcases how to leverage BM25 text queries and vector search in PostgreSQL using this extension. The Top 1 recall of BM25 on this dataset is 0.77. If you reproduce the result, your operations are correct.
Comparison with pg_search
pg_bestmatch.rsonly provides methods for generating sparse vectors and does not support index-based search (which can be achieved by pgvecto.rs or pgvector).pg_searchperforms BM25 retrieval via the externaltantivyengine, which may have limitations when combined with transactions, filters, or JOIN operations. Sincepg_bestmatch.rsis entirely native to Postgres, it offers full compatibility with these operations inside postgres.
Reference
tokenize- Description: Tokenizes an input string into individual tokens.
- Example:
SELECT tokenize('i have an apple'); -- result: {i,have,an,apple}
bm25_create- Description: Creates BM25 statistics for a specified table and column.
- Usage:
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25'); - Parameters:
table_name: Name of the table.column_name: Name of the column.stat_name: Name of the BM25 statistics.b: BM25 parameter (default 0.75).k: BM25 parameter (default 1.2).
bm25_refresh- Description: Updates the BM25 statistics to reflect any changes in the underlying data.
- Usage:
SELECT bm25_refresh('documents_passage_bm25'); - Parameters:
stat_name: Name of the BM25 statistics to update.
bm25_drop- Description: Deletes the BM25 statistics for a specified table and column.
- Usage:
SELECT bm25_drop('documents_passage_bm25'); - Parameters:
stat_name: Name of the BM25 statistics to delete.
bm25_document_to_svector- Description: Converts document text into a sparse vector representation.
- Usage:
SELECT bm25_document_to_svector('documents_passage_bm25', 'document_text'); - Parameters:
stat_name: Name of the BM25 statistics.document_text: The text of the document.style: Emitspgvecto.rs-style sparse vector orpgvector-style sparse vector.
bm25_query_to_svector- Description: Converts query text into a sparse vector representation.
- Usage:
SELECT bm25_query_to_svector('documents_passage_bm25', 'We begin, as always, with the text.'); - Parameters:
stat_name: Name of the BM25 statistics.query_text: The text of the query.style: Emitspgvecto.rs-style sparse vector orpgvector-style sparse vector.
Feedback
Was this page helpful?
Thanks for the feedback! Please let us know how we can improve.
Sorry to hear that. Please let us know how we can improve.