Number of parameters
For my understanding, why is it callled a 7B if it has 8.54B parameters in the safetensors?
@HugoLaurencon I think they are trying to compete with mistral-7b so they are faking the name to seem like its smaller than it actually is. Because the more popular model is a 7b parameter size. If google has a better explanation they can pitch in here.
(Disclaimer: I'm not from the Gemma development team, and this explanation is to the best of my understanding) The model itself contains close to 7B parameters. However the number you see on the model page on HF should also include the embeddings layer, which would add to the overall number (but is not strictly part of the model size). If you see the Mistral 7B model, it also has a small number of parameters above 7B on the HF page. However, the vocabulary for Gemma 7B is much larger (~8x), which would result in a larger number of params shown on HF
Ok thanks! I leave this open if people want to look at it but feel free to close.
CodeGemma (https://goo.gle/codegemma) uses the term "size class".
I think it's better to be represent it as an 8B model. Yes, Mistral-7B has small number of parameters above 7B(7.24) however we call it 7B due by rounding it off to 7B, i.e we remove the fractional part to its nearest neighbor. Similar case with llama-2 which has number of parameters lower than than 7B(6.7 iirc) but we round it off to 7B. :)
See related discussion at https://huggingface.co/google/gemma-7b/discussions/24#65d68271d22ae470a08c7629
Hi @HugoLaurencon , Sorry for late response, many of those are embedding parameters, which we often do not count in the total parameter count for papers and releases. With respect to the emerging 7B class of open models, we've targeted the same use cases as other models in the 7B class from a hardware and software compatibility standpoint -- so it should be strictly transferable for many, if not all, 7B-class use cases.
Thank you.