Mergoo: Efficiently Build Your Own MoE LLM
Leeroo Team: Majid Yazdani, Alireza Mohammadshahi, Arshad Shaikh
We've recently developed mergoo, a library for easily merging multiple LLM experts, and efficiently train the merged LLM. Now, you can efficiently integrate the knowledge of different generic or domain-based LLM experts.
๐ In mergoo, you can:
- Easily merge multiple open-source LLMs
- Apply many merging methods: Mixture-of-Experts, Mixture-of-Adapters, and Layer-wise merging
- Efficiently train a MoE without starting from scratch
- Compatible with Hugging Face ๐ค Models and Trainers
Content
Introduction
mergoo has been designed to build a reliable and transparent pipeline for merging the knowledge of various LLM experts, whether they are generic or domain-specific. It incorporates a range of integration techniques such as mixture-of-experts, mixture-of-adapters, and layer-wise merging, offering flexibility to LLM builders. Once the merged LLM is built, you can further fine-tune it for your specific use cases, using Hugging Face ๐ค Trainers such as SFTrainer, PEFT, Trainer, and more.
In the following sections, we outline two examples demonstrating how to build a merged LLM from fully fine-tuned LLMs using MoE, and how to create a mixture-of-adapter LLM from LoRA fine-tuned experts.
Mixture of Fully Fine-tuned LLMs
Following Branch-Train-Mix, Domain-specific LLM experts can be integrated by bringing together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averaging the remaining parameters. MoE layers can be later fine-tuned to learn token-level routing.
As a sample, we integrate the following domain-specific LLM experts:
- Base Model: meta-llama/Llama-2-7b-hf
- Code Expert: codellama/CodeLlama-7b-Python-hf
- WikiChat Expert: stanford-oval/Llama-2-7b-WikiChat-fused
Specifiy the config for merging:
config = \
{
"model_type": "llama",
"num_experts_per_tok": 2,
"experts":[
{
"expert_name" : "base_expert",
"model_id" : "meta-llama/Llama-2-7b-hf"
},
{
"expert_name" : "expert_1",
"model_id" : "codellama/CodeLlama-7b-Python-hf"
},
{
"expert_name" : "expert_2",
"model_id" : "stanford-oval/Llama-2-7b-WikiChat-fused"
}
],
"router_layers":[
"gate_proj",
"up_proj",
"down_proj"
],
}
Then, build the checkpoint of the merged expert, and save it:
import torch
from mergoo.compose_experts import ComposeExperts
model_id = "mergoo_llama_code_wikichat"
expertmerger = ComposeExperts(config, torch_dtype=torch.float16)
expertmerger.compose()
expertmerger.save_checkpoint(model_id)
In the following, the checkpoint of merged LLM is loaded, then further fine-tuned on Python Code Instruction dataset:
from mergoo.models.modeling_llama import LlamaForCausalLM
import torch
import datasets
import random
from trl import SFTTrainer
from transformers import TrainingArguments
# load the composed checkkpoint
model = LlamaForCausalLM.from_pretrained(
"mergoo_llama_code_wikichat",
device_map="auto",
torch_dtype=torch.bfloat16,
)# 'gate' / router layers are untrained hence loaded warning would appeare for them
# load the train dataset
dataset = datasets.load_dataset("iamtarun/python_code_instructions_18k_alpaca")['train']
dataset = dataset['prompt']
random.shuffle(dataset)
train_dataset = datasets.Dataset.from_dict(dict(prompt=dataset[:-1000]))
eval_dataset = datasets.Dataset.from_dict(dict(prompt=dataset[-1000:]))
# specify training arguments
trainer_args = TrainingArguments(
output_dir= "checkpoints/llama_moe",
per_device_train_batch_size = 1,
per_device_eval_batch_size = 1,
learning_rate= 1e-5,
save_total_limit=1,
num_train_epochs=1,
eval_steps= 5000,
logging_strategy="steps",
logging_steps= 25,
gradient_accumulation_steps=4,
bf16=True
)
trainer = SFTTrainer(
model,
args= trainer_args,
train_dataset= train_dataset,
eval_dataset= eval_dataset,
dataset_text_field="prompt",
)
# start training
trainer.train()
Then you can push the code to Huggingface Hub (please do!):
model.push_to_hub("mergoo_llama_code_wikichat_trained")
mergoo also supports mistral
,bert
, and phi3
based experts.
Mixture of Adapters
mergoo facilitates the merging of multiple adapters (LoRA) into a unified MoE-style architecture. This is achieved by applying gating and routing layers on top of fine-tuned LoRAs.
To build a mixture-of-adapters LLM:
- Collect a pool of fine-tuned adapters (LoRA) with the same base model
- Apply mergoo to create a MoE-style merged expert
- Fine-tune the merged expert on your downstream task
For instance, the following experts can be merged for the customer support domain:
Specify the config, and build the merged checkpoint, as:
import torch
from mergoo.compose_experts import ComposeExperts
model_id = "mergoo_customer_suppoer_moe_lora"
config = {
"model_type": "mistral",
"num_experts_per_tok": 2,
"base_model": "mistralai/Mistral-7B-v0.1",
"experts": [
{
"expert_name": "adapter_1",
"model_id": "predibase/customer_support"
},
{
"expert_name": "adapter_2",
"model_id": "predibase/customer_support_accounts"
},
{
"expert_name": "adapter_3",
"model_id": "predibase/customer_support_orders"
}
],
}
expertmerger = ComposeExperts(config, torch_dtype=torch.bfloat16)
expertmerger.compose()
expertmerger.save_checkpoint(model_id)
Note: when candidate experts for merging are fine-tuned with LoRA, expert_name
starts with adapter_
.
You can further fine-tune the merged expert as defined in Mixture of Fully Fine-tuned LLMs section.
Conclusion
mergoo can be used to reliably and transparently integrate the knowledge of multiple experts. It supports several intergation techniques including mixture-of-expert, mixture-of-adapters (MoE-LoRA), and layer-wise merging. The merged LLM can be further fine-tuned on the downstream task to provide a reliable expert.
Learn More
๐ To a deeper dive into mergoo, please check our repository.
๐ In Leeroo, our vision is to democratize AI for individuals everywhere, and open-sourcing is the foundational step toward realizing this vision. Join Leeroo, where we can work together to make this vision a reality!
Linkedin, Discord, X, Website.