Same procedure as last time?
Sort of.
This model is a block expanded danube2, using the Llama Pro method of only training (or fine tuning) the expanded blocks. To do this on limited hardware I had to expand by 2 layers per step, from the original 24 to 32. At least, that was the original plan. With the 32 layer model I used BAdam to do a "once over" with most the datasets I also used to expand the model. While it is a faux full fine tune, it isn't really that different from the Llama Pro method, e.g. layerwise insertion of data.
I have a feeling that Llama3 and other well trained models feels better because of markdown (formatting), personality (friendliness), and prompt compliance (prefereneceness.. I guess). Thus I have used Llama3 8B, WizardLM2, and Hermes 2 Pro Mistral to generate training data for this model.
To ensure that the full 8k context window could be utilised this time I filtered openhermes, Synthia, LongAlpaca, and MathInstruct for entries with a token count between 2k and 8k, to DoRA, QLoRA, and BAdam the context window into submission. One time, elsewhere, even with lm_head
as an additional target, and twice with embed_tokens
.
The astute among you may notice the extra special tokens like the fim and thought tokens. NinjaMouse has not been trained to use those.. Yet! Also: This is actually 34 layers. Surprise!
Here's the thing with the 2 extra layers compared to my first model. When I trained NinjaMouse2 with 32 layers I noticed that the grad_norm
value would behave strangely on layer 3 and 27. The last layer, before the expansion used to be 27, while 3 is a mystery. I decided to use mergekit to copy layer 3 and insert it beside the original, and copy layer 27 and insert it at the end or top (the new 33, all 0 indexed), depending on your perspective.
The procedure
24 -> 26
- LDJnr/Capybara
- m-a-p/Code-Feedback
- m-a-p/CodeFeedback-Filtered-Instruction
- WRN non enhanced
- abacusai/SystemChat
26 -> 28
- toolcall 10k
- migtissera/Synthia-v1.3
- TIGER-Lab/MathInstruct
28 -> 30
- glaiveai/glaive-code-assistant
- hiyouga/glaive-function-calling-v2-sharegpt
- Weyaxi/sci-datasets (w/o code feedback instruct, mathinstruct, camel)
30 -> 32
- jondurbin/airoboros-3.2
- teknium/openhermes
- WRN enhanced
- garage-bAInd/Open-Platypus
- vicgalle/alpaca-gpt4
Post tuning
Self-reward with a teacher is what this approach can be confidently called. I wish there were a distilled version of that name, but I am coming up blank.
I have any model generate a bunch of prompts that a teacher model answers with gusto (the chosen column), and then have NinjaMouse2 also answer them (as the rejects). BAM. Skibidibi doo. Have I made these DPO datasets? No. But the prompts, their evaluations, along with responses of its own, responses from better models, and evaluations of both of them are included in the training. You can find the dataset here.
Ollama
I have quantised this model and made it available through LM Studio and Ollama in Q4KM and Q6K.
ollama run trollek/ninjamouse2:34l-q4_K_M
or
ollama run trollek/ninjamouse2:34l-q6_K
Quantizations
@cgus has done a great job with the quants. Reducing models from 16bit to ~2, and every bit inbetween, is Numberwang and much appreciated.
- GGUF iMatrix: cgus/NinjaMouse2-2.5B-v0.1-iMat-GGUF
- Exllamav2: cgus/NinjaMouse2-2.5B-v0.1-exl2
- GGUF: trollek/NinjaMouse2-2.5B-v0.1-GGUF
Notes
License
To use this model you agree to use it like Spider-man: Apache 2.0 + White Rabbit Neo (below)
You agree not to use the Model or Derivatives of the Model:
- In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party;
- For military use in any way;
- For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
- To generate or disseminate verifiably false information and/or content with the purpose of harming others;
- To generate or disseminate inappropriate content subject to applicable regulatory requirements;
- To generate or disseminate personal identifiable information without due authorization or for unreasonable use;
- To defame, disparage or otherwise harass others;
- For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
- For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
- To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
- For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
Template
I made this (OpenChatML like) template for LLama Factory and added it to the bottom of LLama-Factory/src/llmtuner/data/template.py
_register_template(
name="ninja_chatml",
format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}\n<|im_end|>\n"]), # Works
format_assistant=StringFormatter(slots=["<|im_start|>assistant\n{{content}}\n<|im_end|>", {"eos_token"}]), # Works
format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}\n<|im_end|>\n"]), # NinjaMouse does not like BOS!
format_function=FunctionFormatter(slots=["<|im_start|>assistant\n<tool_call>\n{\"name\":\"{{name}}\", \"arguments\":{{arguments}}}\n</tool_call>\n<|im_end|>", {"eos_token"}]), # Works
format_observation=StringFormatter(slots=["<|im_start|>tool\n<tool_response>\n{{content}}\n</tool_response>\n<|im_end|>\n"]), # Works
format_separator=EmptyFormatter(slots=["\n"]), # It makes sense to keep this a new line instead of </s> and apply the eos token directly
format_tools=ToolFormatter(tool_format="open_chatml"),
)
To format the tools I have added the following code to formatter.py
in the same folder.
# At the top
HERMES_TOOL_PROMPT = (
"\n<tools>\n"
"{funtion_description}\n"
"</tools>\n"
)
# I only added the elif
@dataclass
class ToolFormatter(Formatter):
def __post_init__(self):
if self.tool_format is None:
raise ValueError("Tool format was not found.")
def apply(self, **kwargs) -> SLOTS:
content = kwargs.pop("content")
try:
tools = json.loads(content)
if not len(tools):
return [""]
if self.tool_format == "default":
return [default_tool_formatter(tools)]
elif self.tool_format == "open_chatml": # This right here
return [OPEN_CHATML_TOOL_PROMPT.format(funtion_description=json.dumps(tools, ensure_ascii=False, indent=4))] # I used 4 but OpenChatML has 2
else:
raise NotImplementedError
except Exception:
return [""]
def extract(self, content: str) -> Union[str, Tuple[str, str]]:
if self.tool_format == "default":
return default_tool_extractor(content)
else:
raise NotImplementedError
Model specs
MistralForCausalLM(
(model): MistralModel(
(embed_tokens): Embedding(32009, 2560, padding_idx=0)
(layers): ModuleList(
(0-33): 34 x MistralDecoderLayer(
(self_attn): MistralSdpaAttention(
(q_proj): Linear(in_features=2560, out_features=2560, bias=False)
(k_proj): Linear(in_features=2560, out_features=640, bias=False)
(v_proj): Linear(in_features=2560, out_features=640, bias=False)
(o_proj): Linear(in_features=2560, out_features=2560, bias=False)
(rotary_emb): MistralRotaryEmbedding()
)
(mlp): MistralMLP( (gate_proj): Linear(in_features=2560, out_features=6912, bias=False)
(up_proj): Linear(in_features=2560, out_features=6912, bias=False)
(down_proj): Linear(in_features=6912, out_features=2560, bias=False)
(act_fn): SiLU()
)
(input_layernorm): MistralRMSNorm()
(post_attention_layernorm): MistralRMSNorm()
)
)
(norm): MistralRMSNorm()
)
(lm_head): Linear(in_features=2560, out_features=32009, bias=False)
)
- Downloads last month
- 15