Some weights of the model checkpoint at susnato/phi-2 were not used when initializing PhiForCausalLM

#4
by g-ronimo - opened
model_path="susnato/phi-2"
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    torch_dtype=torch.bfloat16, 
    device_map="auto",
)

Model outputs only garbage. What's wrong here?
Using latest transformers git pulled an hour ago

full error message:

Some weights of the model checkpoint at susnato/phi-2 were not used when initializing PhiForCausalLM: ['model.layers.0.self_attn.query_key_value.bias', 'model.layers.0.self_attn.query_key_value.weight', 'model.layers.1.self_attn.query_key_value.bias', 'model.layers.1.self_attn.query_key_value.weight', 'model.layers.10.self_attn.query_key_value.bias', 'model.layers.10.self_attn.query_key_value.weight', 'model.layers.11.self_attn.query_key_value.bias', 'model.layers.11.self_attn.query_key_value.weight', 'model.layers.12.self_attn.query_key_value.bias', 'model.layers.12.self_attn.query_key_value.weight', 'model.layers.13.self_attn.query_key_value.bias', 'model.layers.13.self_attn.query_key_value.weight', 'model.layers.14.self_attn.query_key_value.bias', 'model.layers.14.self_attn.query_key_value.weight', 'model.layers.15.self_attn.query_key_value.bias', 'model.layers.15.self_attn.query_key_value.weight', 'model.layers.16.self_attn.query_key_value.bias', 'model.layers.16.self_attn.query_key_value.weight', 'model.layers.17.self_attn.query_key_value.bias', 'model.layers.17.self_attn.query_key_value.weight', 'model.layers.18.self_attn.query_key_value.bias', 'model.layers.18.self_attn.query_key_value.weight', 'model.layers.19.self_attn.query_key_value.bias', 'model.layers.19.self_attn.query_key_value.weight', 'model.layers.2.self_attn.query_key_value.bias', 'model.layers.2.self_attn.query_key_value.weight', 'model.layers.20.self_attn.query_key_value.bias', 'model.layers.20.self_attn.query_key_value.weight', 'model.layers.21.self_attn.query_key_value.bias', 'model.layers.21.self_attn.query_key_value.weight', 'model.layers.22.self_attn.query_key_value.bias', 'model.layers.22.self_attn.query_key_value.weight', 'model.layers.23.self_attn.query_key_value.bias', 'model.layers.23.self_attn.query_key_value.weight', 'model.layers.24.self_attn.query_key_value.bias', 'model.layers.24.self_attn.query_key_value.weight', 'model.layers.25.self_attn.query_key_value.bias', 'model.layers.25.self_attn.query_key_value.weight', 'model.layers.26.self_attn.query_key_value.bias', 'model.layers.26.self_attn.query_key_value.weight', 'model.layers.27.self_attn.query_key_value.bias', 'model.layers.27.self_attn.query_key_value.weight', 'model.layers.28.self_attn.query_key_value.bias', 'model.layers.28.self_attn.query_key_value.weight', 'model.layers.29.self_attn.query_key_value.bias', 'model.layers.29.self_attn.query_key_value.weight', 'model.layers.3.self_attn.query_key_value.bias', 'model.layers.3.self_attn.query_key_value.weight', 'model.layers.30.self_attn.query_key_value.bias', 'model.layers.30.self_attn.query_key_value.weight', 'model.layers.31.self_attn.query_key_value.bias', 'model.layers.31.self_attn.query_key_value.weight', 'model.layers.4.self_attn.query_key_value.bias', 'model.layers.4.self_attn.query_key_value.weight', 'model.layers.5.self_attn.query_key_value.bias', 'model.layers.5.self_attn.query_key_value.weight', 'model.layers.6.self_attn.query_key_value.bias', 'model.layers.6.self_attn.query_key_value.weight', 'model.layers.7.self_attn.query_key_value.bias', 'model.layers.7.self_attn.query_key_value.weight', 'model.layers.8.self_attn.query_key_value.bias', 'model.layers.8.self_attn.query_key_value.weight', 'model.layers.9.self_attn.query_key_value.bias', 'model.layers.9.self_attn.query_key_value.weight']
- This IS expected if you are initializing PhiForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing PhiForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of PhiForCausalLM were not initialized from the model checkpoint at susnato/phi-2 and are newly initialized: ['model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.11.self_attn.k_proj.bias', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.11.self_attn.q_proj.bias', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.bias', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.12.self_attn.k_proj.bias', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.12.self_attn.q_proj.bias', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.bias', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.13.self_attn.k_proj.bias', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.13.self_attn.q_proj.bias', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.13.self_attn.v_proj.bias', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.14.self_attn.k_proj.bias', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.q_proj.bias', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.bias', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.15.self_attn.k_proj.bias', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.15.self_attn.q_proj.bias', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.15.self_attn.v_proj.bias', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.16.self_attn.k_proj.bias', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.16.self_attn.q_proj.bias', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.16.self_attn.v_proj.bias', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.17.self_attn.k_proj.bias', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.17.self_attn.q_proj.bias', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.bias', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.18.self_attn.k_proj.bias', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.18.self_attn.q_proj.bias', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.18.self_attn.v_proj.bias', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.19.self_attn.k_proj.bias', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.19.self_attn.q_proj.bias', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.bias', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.2.self_attn.k_proj.bias', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.2.self_attn.q_proj.bias', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.bias', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.20.self_attn.k_proj.bias', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.20.self_attn.q_proj.bias', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.20.self_attn.v_proj.bias', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.21.self_attn.k_proj.bias', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.bias', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.22.self_attn.k_proj.bias', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.22.self_attn.q_proj.bias', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.22.self_attn.v_proj.bias', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.self_attn.k_proj.bias', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.23.self_attn.q_proj.bias', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.bias', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.24.self_attn.k_proj.bias', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.self_attn.q_proj.bias', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.24.self_attn.v_proj.bias', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.25.self_attn.k_proj.bias', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.25.self_attn.q_proj.bias', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.bias', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.26.self_attn.k_proj.bias', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.26.self_attn.q_proj.bias', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.self_attn.v_proj.bias', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.27.self_attn.k_proj.bias', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.self_attn.q_proj.bias', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.27.self_attn.v_proj.bias', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.28.self_attn.k_proj.bias', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.28.self_attn.q_proj.bias', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.28.self_attn.v_proj.bias', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.29.self_attn.k_proj.bias', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.self_attn.q_proj.bias', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.bias', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.3.self_attn.k_proj.bias', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.3.self_attn.q_proj.bias', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.bias', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.30.self_attn.k_proj.bias', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.30.self_attn.q_proj.bias', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.30.self_attn.v_proj.bias', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.31.self_attn.k_proj.bias', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.31.self_attn.q_proj.bias', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.31.self_attn.v_proj.bias', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.4.self_attn.k_proj.bias', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.4.self_attn.q_proj.bias', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.4.self_attn.v_proj.bias', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.5.self_attn.k_proj.bias', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.5.self_attn.q_proj.bias', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.5.self_attn.v_proj.bias', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.6.self_attn.k_proj.bias', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.6.self_attn.q_proj.bias', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.bias', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.7.self_attn.k_proj.bias', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.7.self_attn.q_proj.bias', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.7.self_attn.v_proj.bias', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.8.self_attn.k_proj.bias', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.8.self_attn.q_proj.bias', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.bias', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.9.self_attn.k_proj.bias', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.9.self_attn.q_proj.bias', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.bias', 'model.layers.9.self_attn.v_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Hi, @g-ronimo , yes because of the latest commit, the weights order is now changed to seperate q,k,v layers instead of a concatenated qkv layer. I will fix this ASAP. :)

Hi @g-ronimo , I wasn't able to look into updating the weights but will try to do that tomorrow, sorry for the delay.

i'm just glad you're on it πŸ‘ thank you

@susnato hii dude, has the bug been fixed?

Hi @g-ronimo , @g-h-chen , I apologise for the huge delay.
I have updated the checkpoint and it should work now. Please install the latest transformers version 4.38.0.dev0 by running this command -

pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers

After updating the library, it should work as expected.

Please let me know if you are facing any issues with it.

works! πŸ‘Œ thank you

Thanks a lot @g-ronimo for the feedback! closing the issue since it is fixed.

susnato changed discussion status to closed

thank you! one last stupid question, what's the difference now between your model and microsoft/phi-2 ?
Is it that we can load microsoft/phi-2 only with trust_remote_code=True while yours works with the HF modeling files?

Hi @g-ronimo , If you load the latest commit from microsoft/phi-2 now, then it's the same. So, you can use any of these.

Actually some months ago I contributed phi (1 and 1.5) to HF transformers library so Arthur told me to also add phi-2 (mainly to reorder the weights, update configs and add tests) and upload the updated weights here so that anyone who wants to use phi-2 with the library model can use it along with all the features that come with the library.
So, I did that and now we have successfully transferred the weights to microsoft/phi-2 and these repos are same now. It was a temporary fix.

Feel free to ask any question πŸ€—, I will try to answer them according to my knowledge.

Sign up or log in to comment