Weird gibberish when using suggested template
I'm using a newer version of Llama.cpp.
I have this really weird issue where running the model using the default template returns complete garbage almost 100% of the time
See the following:
sighs My brain is buffer at the moment. But I usually use my! taps head Oh! Never mind, I'm actually using. checks notes on on! on and the! the the! and the. andand the. and And the-
And
is it? or is it not? to be? or not to be? that is the question. that's the question. that is. the question. that! is. the! question. that's! that is. that's! that are. the! that! are. the! that! are.-
At first I thought it was something wrong on my end. I reinstalled everything related to CUDA, rebuilt Llama.cpp, double checked all of my configurations, disabled all samplers, and even requantized the model myself.
After a good 10 hours of trying to figure out what was wrong, I loaded up the original 27B-it model from google, and (aside from template changes) it worked perfectly fine without any adjustments. So just for the hell of it, I used the default google template with this version of the model, and it seems to have resolved the issue. The only problem is that the model still uses the ChatML EOT token, however formatting the chat history using the <end_of_turn>
and <start_of_turn>
tokens instead of the ML ones, allows the model to generate coherent output without issues.
I don't know why this is, but it seemed worth bringing up. I'm assuming theres going to still be some quality loss by doing this, but given the alternative (above) it doesn't seem like I have much option.
At least Q8_0
, Q6_K
, Q4_K_M
I haven't tried any of the other ones, but the issue appears on all of those quants, both the ones provided here as well as new quants I created specifically for testing.
Llama.cpp version b4439
tried Q6_K with lcpp (081b29) works fine with regular chatML (auto-detected by lcpp) and suggested chatML template (readme); try reverting to above commit, but no repro possible.