Model Summary

NLLB-SigLIP-MRL is a model that combines a text encoder from the NLLB model and an image encoder from the SigLIP model. This allows us to extend the model capabilities to 201 languages of the Flores-200. This version of the model was trained using a variation of Matryoshka Representation learning to enable the generation of embeddings of sizes [32, 64, 128, 256, 512] in addition to the original 1152. Based on the benchmarks below, embeddings of sizes 256 and 512 preserve 90%+ of the full embedding quality.

image/png

The full embedding model sets new state-of-the-art for multilingual image and text retrieval on both XTD10 and Crossmodal-3600.

Dataset image retrieval R@1, avg text retrieval R@1, avg image retrieval R@5, avg text retrieval R@5, avg image retrieval R@10, avg text retrieval R@10, avg
Crossmodal-3600 0.6079 0.5741 0.8333 0.8174 0.8922 0.8816
XTD10 0.6997 0.6433 0.8988 0.8848 0.9503 0.9449

How to use

Variable resolutions

Open In Colab

If you want to use the model that supports variable embedding sizes, you can do it as follows:

!pip install -U transformers open_clip_torch
from transformers import AutoModel
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("visheratin/nllb-siglip-mrl-large", device="cpu", trust_remote_code=True)

image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)

class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]

image_logits, text_logits = model.get_logits(
    images=[image],
    texts=class_options,
    langs=class_langs,
    resolution=512 # set resolution here or set `None` to use the original resolution
)

print(torch.softmax(image_logits, dim=1))

OpenCLIP

This model is also integrated into OpenCLIP so that you can use it as any other model:

!pip install -U open_clip_torch
from open_clip import create_model_from_pretrained, get_tokenizer
from PIL import Image
import requests
import torch

model, transform = create_model_from_pretrained("nllb-clip-large-siglip", "mrl", device="cuda")

tokenizer = get_tokenizer("nllb-clip-large-siglip")

class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]

text_inputs = []
for i in range(len(class_options)):
    tokenizer.set_language(class_langs[i])
    text_inputs.append(tokenizer(class_options[i]))
text_inputs = torch.stack(text_inputs).squeeze(1).to("cuda")

image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)

image_inputs = transform(image).unsqueeze(0).to("cuda")

with torch.inference_mode():
    logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

print(logits_per_image.softmax(dim=-1))

Acknowledgements

I thank ML Collective for providing Google Cloud compute resources.

Downloads last month
778
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train visheratin/nllb-siglip-mrl-large