Evaluation on MIRACL japanese
These models don't train on the MIRACL training data.
Model | nDCG@10 | Recall@1000 | Recall@5 | Recall@30 |
---|---|---|---|---|
BM25 | 0.369 | 0.931 | - | - |
splade-japanese | 0.405 | 0.931 | 0.406 | 0.663 |
splade-japanese-efficient | 0.408 | 0.954 | 0.419 | 0.718 |
splade-japanese-v2 | 0.580 | 0.967 | 0.629 | 0.844 |
splade-japanese-v2-doc | 0.478 | 0.930 | 0.514 | 0.759 |
splade-japanese-v3 | 0.604 | 0.979 | 0.647 | 0.877 |
*'splade-japanese-v2-doc' model does not require query encoder during inference.
Evaluation on hotchpotch/JQaRA
JQaRa | |||||
---|---|---|---|---|---|
NDCG@10 | MRR@10 | NDCG@100 | MRR@100 | ||
splade-japanese-v3 | 0.505 | 0.772 | 0.7 | 0.775 | |
JaColBERTv2 | 0.585 | 0.836 | 0.753 | 0.838 | |
JaColBERT | 0.549 | 0.811 | 0.730 | 0.814 | |
bge-m3+all | 0.576 | 0.818 | 0.745 | 0.820 | |
bg3-m3+dense | 0.539 | 0.785 | 0.721 | 0.788 | |
m-e5-large | 0.554 | 0.799 | 0.731 | 0.801 | |
m-e5-base | 0.471 | 0.727 | 0.673 | 0.731 | |
m-e5-small | 0.492 | 0.729 | 0.689 | 0.733 | |
GLuCoSE | 0.308 | 0.518 | 0.564 | 0.527 | |
sup-simcse-ja-base | 0.324 | 0.541 | 0.572 | 0.550 | |
sup-simcse-ja-large | 0.356 | 0.575 | 0.596 | 0.583 | |
fio-base-v0.1 | 0.372 | 0.616 | 0.608 | 0.622 |
下のコードを実行すれば,単語拡張や重み付けの確認ができます.
If you'd like to try it out, you can see the expansion of queries or documents by running the code below.
you need to install
!pip install fugashi ipadic unidic-lite
from transformers import AutoModelForMaskedLM,AutoTokenizer
import torch
import numpy as np
model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3")
tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3")
vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}
def encode_query(query): ##query passsage maxlen: 32,180
query = tokenizer(query, return_tensors="pt")
output = model(**query, return_dict=True).logits
output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
return output
with torch.no_grad():
model_output = encode_query(query="筑波大学では何の研究が行われているか?")
reps = model_output
idx = torch.nonzero(reps[0], as_tuple=False)
dict_splade = {}
for i in idx:
token_value = reps[0][i[0]].item()
if token_value > 0:
token = vocab_dict[int(i[0])]
dict_splade[token] = float(token_value)
sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
for token, value in sorted_dict_splade:
print(token, value)
- Downloads last month
- 1,179
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.