BM25 for Python: Achieving high performance while simplifying dependencies with BM25S

Community Article Published July 9, 2024

Summary: In this blog post, we examine pros and cons of popular lexical search libraries, and introduce BM25S⚡, a new library that implements BM25 for fast lexical search in Python and achieve up to 500x speedup compared to the most popular Python-based library. We discuss how to use it with 🤗 Hugging Face Hub.

If you want more technical details, take a look at the technical report!

What is lexical search?

Lexical search is a type of search where a query is compared against a collection of text documents (as short as a sentence or as long as an entire article) by directly matching the words in the query. For example, given a query like "What is a cat?", you would look for documents that contains one or more of the words: "What", "is", "a", "cat". You can assign a score to each document based on how important each word is (by assigning a score to each word), and how many words are matched (e.g. by summing up the scores).

The best known algorithm for lexical search is BM25, an algorithm that was developed by Stephen Robertson and his collaborators in the 1990s. It is used as the backbone for highly scalable commercial search engines like Elasticsearch (the Python client receives 24M downloads per month), and has been implemented in Python (rank-bm25), where it is widely used as well.

Elastic Rank-BM25
image/png image/png

Both Elasticsearch and Rank-BM25 are built with different goals in mind.

Elasticsearch is able to achieve high performance and scalability, allowing you to search among billions of documents in real-time by leveraging multi-node setups. However, to get started, you need to set up a web server that runs Apache Lucene in Java, and access it via a separate web client (the elasticsearch library that can be installed with pip).

Elastic Server setup Elastic Client setup
image/png image/png

On the other hand, Rank-BM25 is significantly more accessible in Python, as it solely Numpy and only needs to be installed with pip, avoiding having to set up a server. Moreover, the implementation is fairly readable and you can use your own tokenizer. However, it is unable to achieve the same performance in terms of queries/seconds when you use it for larger corpora (e.g. exceeding 1 million documents).

Enter BM25S ⚡: a fast but low-dependency implementation of BM25

As a way to achieve high performance without having to rely on external dependencies, I recently introduced BM25S, a new lexical search library that leverages scipy's sparse matrices to achieve orders of magnitudes faster search compared to Rank-BM25, all while staying in the Python ecosystem (which means you can install it with a simple pip install bm25s).

On a single node, in the single-threaded and multicore setting (4 threads), it achieves performance on par with Elasticsearch (benchmarked on 5 popular datasets from BEIR):

image/png

You can learn more about it at the following links:

However, all you need to get started is this code snippet:

# pip install bm25s
import bm25s

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

# Query the corpus and get top-k results
query = "does the fish purr like a cat?"
results, scores = retriever.retrieve(bm25s.tokenize(query), k=2)

# Let's see what we got!
doc, score = results[0, 0], scores[0, 0]
print(f"Rank {i+1} (score: {score:.2f}): {doc}")

If you want more advanced examples, take a look at the README, including stemming, stopwords and customization.

🤗 Hugging Face integrations

BM25S is tightly integrated with Huggingface Hub, allowing you to easily save to, and load from the hub. First, simply install the library:

pip install huggingface_hub

Now, you can create an index and save with BM25HF.save_to_hub:

import bm25s
from bm25s.hf import BM25HF

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

# Set your username and token
user = "your-username"
retriever.save_to_hub(f"{user}/bm25s-animals")

If you visit huggingface.co/{user}/bm25s-animals, you will find your newly uploaded BM25S index!

Now, to use it, simply load it with BM25HF.load_from_hub:

import bm25s
from bm25s.hf import BM25HF

# Load the index
retriever = BM25HF.load_from_hub("{user}/bm25s-animals", load_corpus=True)

# You can retrieve now
query = "a cat is a feline"

docs, scores = retriever.retrieve(bm25s.tokenize(query), k=2)

Selecting a BM25 variant

A recent paper about BM25 that I really like is "Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants" by Chris Kamphuis & co-authors.: it succinctly and formally present multiple popular variants of BM25 and advocate for researchers to explicitly define the variants they implement.

Based on the paper, BM25S offers the following variants:

  • Original (method="robertson")
  • ATIRE (method="atire")
  • BM25L (method="bm25l")
  • BM25+ (method="bm25+")
  • Lucene (method="lucene") - default

You can change this by simply specifying BM25(method="<preferred variant>").

How does it achieve the speedup?

More precisely, it computes all possible word-level relevance scores for every document in the corpus and store them in a sparse matrix (this idea is inspired by Jack Morris's bm25-pt). Then, given a query you can sum up the relevant tokens to get a relevance score for each document - this happens via simple scipy.sparse operations, which can also be implemented in Numpy (allowing us to leverage memory-mapped arrays).

For BM25 variants that do not have a sparse matrix, we shift the score by a constant (which can be easily calculated based on the relevance score assigne when a word does not occur in a document); this allows us to store the shifted score in a sparse matrix, and recompute the exact relevance score during retrieval.

Does BM25S replace other libraries?

BM25S does not replace other implementations, but complements: Elasticsearch is capable of scaling to multiple nodes, can be interfaced in Java and allows incremental index updates; Rank-BM25 allows you to change parameters during search; Pyserini includes both dense and hybrid search algorithms and many reproducibility tools; BM25-PT can be used on GPUs via PyTorch. Thus, it is important to try multiple implementations, and settle on the one that best suits your needs.

Community

Sign up or log in to comment