max_seq_length
What is the max_seq_length of this model?
https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0#using-huggingface-transformers
the large model code says max_length=512,
https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0#using-huggingface-transformers
but the medium model code says max_length=8192.
Is it right that their max_seq_lengths are different?
I believe this is correct, it's based on the maximum sequence lengths of the respective base models.
I believe this is correct, it's based on the maximum sequence lengths of the respective base models.
You mean 512?
Yes, large should have a maximum sequence length of 512 tokens, and medium a maximum sequence length of 8192. Folks from Snowflake should be able to confirm.
But when I run the following code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0")
print(model.max_seq_length)
I get 8192.
Oh, you're right. That's due to https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0/blob/main/tokenizer_config.json#L50
cc
@spacemanidol
@pxyu
I'm pretty sure it's not possible for a XLM-RoBERTa finetune to exceed 512 tokens unless you've updated the positional embedding matrix.
Nevermind, looks like they can actually process ~6k tokens. These is the shape the token embeddings of 2 queries: torch.Size([2, 6005, 1024])
. Perhaps the max. sequence length is actually 8192 - apologies for the confusion, I'll let the Snowflake team answer.
Both models handle 8192. We use the adjusted version of XMLR provided by the BGE team (BAAI/bge-m3-retromae), which has been extended for 8k context support, so the normal XMLR rules don't appl, haha. Let me get a fix in for the erroneous large model example code!
updated in README so closing.