max_seq_length

#3
by yjoonjang - opened

What is the max_seq_length of this model?
https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0#using-huggingface-transformers
the large model code says max_length=512,
https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0#using-huggingface-transformers
but the medium model code says max_length=8192.

Is it right that their max_seq_lengths are different?

I believe this is correct, it's based on the maximum sequence lengths of the respective base models.

I believe this is correct, it's based on the maximum sequence lengths of the respective base models.

You mean 512?

Yes, large should have a maximum sequence length of 512 tokens, and medium a maximum sequence length of 8192. Folks from Snowflake should be able to confirm.

But when I run the following code:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0")
print(model.max_seq_length)

I get 8192.

Oh, you're right. That's due to https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0/blob/main/tokenizer_config.json#L50
cc @spacemanidol @pxyu I'm pretty sure it's not possible for a XLM-RoBERTa finetune to exceed 512 tokens unless you've updated the positional embedding matrix.

Nevermind, looks like they can actually process ~6k tokens. These is the shape the token embeddings of 2 queries: torch.Size([2, 6005, 1024]). Perhaps the max. sequence length is actually 8192 - apologies for the confusion, I'll let the Snowflake team answer.

Both models handle 8192. We use the adjusted version of XMLR provided by the BGE team (BAAI/bge-m3-retromae), which has been extended for 8k context support, so the normal XMLR rules don't appl, haha. Let me get a fix in for the erroneous large model example code!

Snowflake org

updated in README so closing.

spacemanidol changed discussion status to closed

Sign up or log in to comment