microsoft/phi-4 · Suggested tokenizer changes by Unsloth.ai

Suggested tokenizer changes by Unsloth.ai

#21

by gugarosa - opened 5 days ago

base: refs/heads/main

←

from: refs/pr/21

Discussion Files changed

-11

gugarosa

Microsoft org 5 days ago

No description provided.

fix(root): Adds proposed changes by unsloth.ai.b15fb484

gugarosa changed pull request status to open 5 days ago

danielhanchen

4 days ago

Oh hey @gugarosa :) Hopefully these fixes are all correct! Also wrote up a blog post about it here: https://unsloth.ai/blog/phi4

If you need any help, ask away!

dkleine

2 days ago

•

edited 2 days ago

What I don't understand is the <|dummy_87|> choice for the padding token. What is the purpose of the <|dummy_x|> special tokens?

gugarosa

Microsoft org 2 days ago

Thanks @danielhanchen . The blog post was really helpful, we are just running some extra tests to ensure that no capability is lost, but everything is looking good so far.

gugarosa

Microsoft org 2 days ago

@dkleine Since we padded to the vocabulary size to a multiple of 64 (for better performance), we had to add a set of unused tokens to it, which ended up being called as "dummy" tokens. These tokens were not used during pre-training or fine-tuning, but they can later be replaced and used to encode a new functionality to the model, for example, a <|im_retrieve|> token or something else.

The <|dummy_87|> was purely arbitrary, probably because it was the last token in the vocabulary. It could have been any other dummy token. Even more, it could also be replaced by another string, let's say, <|im_pad|>, since it has never been used before.

jonatbullpointdotorg

2 days ago

Was the model trained using the same token for both bos and eos? If so, how can modifying one now not disrupt the model’s performance, given these tokens define sequence boundaries and altering them could cause premature stopping, incoherent generation, misaligned embeddings, and degraded task performance? @danielhanchen mentioned better metrics for unsloth/phi-4 but do they capture premature stopping or incoherent generation etc?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment