Recent change on the rstrip property on special tokens

#59
by xxhansh - opened

Hi, Recently there's a breaking change in phi3's tokenizer by adding rstrip options: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/commit/4eea1a7b25a14b098aab569599563c37443312cb
However by this change the tokenizer is not lossless anymore, for example:

tokens = tokenizer.encode("<|user|>\n<|end|>\n<|assistant|>")
print(tokenizer.decode(tokens))

will output<|user|><|end|><|assistant|>. Imo this leads to many troubles (for example, drawing ascii arts). Is this change for aligning the actual text generation method during training? If yes, I think it would be better to implement this by changing chat_template to a new-line-free way, since intuitively this happens in chat serialization process, not text tokenization process.

Microsoft org

I agree, this is quite problematic for any use case that iteratively walks between text and tokens (e.g. in guidance: https://github.com/guidance-ai/guidance).

Sign up or log in to comment