context-size is 32k right?
The config.json indicates:
"max_position_embeddings": 32768,
and that, I have t imagine, just about settles the matter. But I noticed in the blog https://sea-sailor.github.io/blog/sailor2/ -- "Currently, Sailor2 supports a context length of up to 4K tokens".
I can't imagine that Qwen2ForCausalLM models could be intended to modify max_position_embeddings by some constant, so it's got to be 32k. If I am wrong, please do let me know more details.
Yes, you're right. The current version of Sailor2 supports a context length of only 4K. This configuration was directly inherited from the Qwen2.5 version purely for simplicity and ease of use. However, we recommend using only 4k input for this version.
We are currently preparing the 32K version of Sailor2-1B, as well as the 128K versions of Sailor2-8B and Sailor2-20B, which are expected to be released by the end of this month. Please stay tuned.
haha but that is still confusing. max_position_embeddings
is context length. There are no special configuration parameters that would modify it, and the default for Qwen2ForCausalLM
(which is the type this model is set to) is an unmodified max_position_embeddings
. To my knowledge, and you can read this on the model's card here on hf, even the 0.5B Qwen2.5 model supports the full 32k -- and 8k max tokens (output).
Did you always train using only 4k of the 32k context length?? (I mean, you didnt provide a change to the base model's config, so it is currently configured to allow 32k context size -- but if the training didn't exercise it, I guess not all the weights were trained as a consequence. If that is so, maybe users should set max_position_embeddings
to 4k, at least in the chat model where the context will casually grow as the user continues to interact with it in many applications)
ref: https://arxiv.org/pdf/2412.15115 table 1
Thanks for pointing that out! It’s much clearer to explicitly state this in the config.json. I’ve updated the configuration by setting max_position_embeddings=4096 across all Sailor2 models.
Meanwhile, we’ve conducted tests on the RULER benchmark. The results indicate that Sailor2-1B can still achieve reasonable performance with an 8K context length, even though it was trained on chunks of only 4K. This means users can process longer contexts beyond 4K, depending on the inference framework (e.g., hf/vllm) and its behavior (warning-but-still-generate or cutoff-then-stop-generate).
The Sailor-1B-0108 version is still under training. Hopefully, it will recover the long-context capability comparable to Qwen2.5-0.5B.
Many thanks for your valuable suggestions and contributions to Sailor2!
wow, that improvement is promising! :)
Thank you for attending my doubts, this makes it easier for me to comfortably make the importance matrix that I wanted to make for quantizing this model. I think it may help people feel more confident with the model in practice as well.
ps I see what you mean, it looks like there was not so much loss even at 8k -- that benchmark is the average in the target languages? anyway, I can't wait to see Sailor 2.1!
Thanks for your encouragement!
In this figure, we evaluate three models on RULER (currently supporting only English). While RULER may not be a perfect solution, it provides some indication of the models' long-context capabilities.
https://arxiv.org/pdf/2404.06654
We are also exploring ways to evaluate long-context capabilities for other SEA languages. However, finding(building) a suitable benchmark for such testing has proven challenging.
In the future, we may consider developing a new benchmark to address this gap.