๐จ Introducing Notus: a DPO fine-tune of Zephyr with a focus on high-quality data
TL; DR
Notus 7B v1 is a DPO fine-tuned version of Zephyr 7B Beta SFT fine-tuned on UltraFeedback, but using the average of the different attributes to binarize the data, instead of the critique score; so that the chosen response is based on the average rather than on the critique score. After the DPO fine-tuning for intent alingment we surpass Zephyr 7B Beta in both AlpacaEval and LM Eval Harness, while almost on par for MT-Bench. All the training code and configuration has been adapted / ported from huggingface/alignment-handbook
and is available at argilla-io/notus
.
Introduction
Zephyr 7B Beta was released some weeks ago, and their aim was to produce a smaller LLM aligned to user intent, rather than focusing merely on benchmarking. Zephyr's approach consisted on applying distilled Supervised Fine-Tuning (dSFT) on larger models and then applying distilled Direct Preference Optimization (dDPO) with preference data from AI Feedback (AIF) datasets like UltraFeedback.
After some experimention they realised that applying DPO right after the SFT was beneficial towards imporving the intent alignment with only a few extra hours of training time, indeed their fine-tuned model, Zephyr 7B Beta, got to surpass other 7B chat models and also even Llama 2 Chat 70B, so they achieved better results within the benchmarks than a 10 times bigger model!
Along with the technical report, they also open sourced their code and recipes, built on top of ๐ค trl
to apply the DPO using UltraFeedback as their preference AIF dataset.
As shown in the screenshot below from Zephyr's technical report, our aim with Notus is to re-iterate on both the response generation and AI ranking stage, while keeping the dSFT stage as is, and apply the dDPO on top of the previously dSFT fine-tuned version of Zephyr, so that the main focus relies on understanding and exploring the AIF data, and experiment around that idea.
Data
As an attempt to reproduce / iterate over Zephyr, we decided to use the same data source as they did, openbmb/UltraFeedback
, while trying to put emphasis on high-quality data and a data curation process. The dataset contains responses for a given prompt generated by different models, and evaluated with AI Feedback (AIF) using GPT-4, so each response has a score per each preference area (instruction-following, truthfulness, honesty and helpfulness), and a rationale justfying it, plus an additional critique score that's like an overall score.
Zephyr used the overall score for the critique task and after visually browsing around some examples sorting by highest rating for chosen responses, we noticed a strong mismatch between the overall score for the critique and the quality of the chosen response. So then we decided to include also the rationale of that score, and we saw that while the critique rationale was highly negative i.e. one should expect a low score; but the overall score was very high (the highest being 10).
Based on an initial analysis, using average of the preference ratings compared to using the overall score generated by the critique model, the highest rated one corresponds to a different chosen response in around 30K examples out of around 63K. Additionally, there seems to be a correlation between the chosen responses with higher overall scores, corresponding to less powerful models.
So, to generate our curated version of the dataset, we had to compute the average of all the preference ratings from UltraFeedback, being instruction-following, truthfulness, honesty and helpfulness, and pick the best one (higher the better) as the chosen response for the DPO-formatted dataset, while picking a random one as rejected out of the remaining responses, while avoiding ties. Meaning that the curated dataset only contains chosen responses with always a higher score than a rejected response.
More information at argilla/ultrafeedback-binarized-preferences
.
Prompt formatting
We use the same one as Zephyr, since we started from the SFT fine-tuned variant, already fine-tuned using the following format:
<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
DPO fine-tuning
DPO stage was pretty straight forward since we reused the same code and recipe as Zephyr, with some slight improvements so as to:
Match the paper details and/or contrast them with the authors, as
warmup_ratio
was missing within the config files and theoptimizer
stated in the paper differed from the one provided within the code.Adding some extra configuration for experiment tracking with
wandb
, checkpoint logging, pushing to the HuggingFace Hub, etc.Writing down a custom data loading and preprocessing function, mimicking the one provided but with slight improvements suited for our data.
Make it work in a VM with 8 x A100 40GB as the default configuration has only been tested with a setting of 8 x V100 80GB, so some small fixes were needed to make it work with the smaller memory GPUs.
After fine-tuning alignment-handbook/zephyr-7b-sft-full
for 3 epochs, we obtained the following metrics during training:
Find all the code and resources at argilla-io/notus
, and if you are willing to know more about the DPO fine-tuning stage, check either Zephyr: Direct Distillation of LM Alignment or Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2 papers, as those provide an awesome introduction and overview to DPO.
Results
Notus stays almost on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval, making Notus one of the most-competitive 7B commercial models on AlpacaEval.
The following table shows the results for both MT-Bench and AlpacaEval benchmarks, and has been adapted from Zephyr-7b-beta and Starling's original tables. The results are shown sorted by their AlpacaEval win rates and we omit some >7B for brevity.
Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) | License |
---|---|---|---|---|---|
GPT-4-turbo | - | ? | 9.32 | 97.70 | Proprietary |
XwinLM 70b V0.1 | 70B | dPPO | - | 95.57 | LLaMA 2 License |
GPT-4 | - | RLHF | 8.99 | 95.03 | Proprietary |
Tulu 2+DPO 70B V0.1 | 70B | dDPO | 6.29 | 95.28 | Proprietary |
LLaMA2 Chat 70B | 70B | RLHF | 6.86 | 92.66 | LLaMA 2 License |
Starling-7B | 7B | C-RLFT + APA | 8.09 | 91.99 | CC-BY-NC-4.0 |
Notus-7b-v1 | 7B | dDPO | 7.30 | 91.42 | MIT |
Claude 2 | - | RLHF | 8.06 | 91.36 | Proprietary |
Zephyr-7b-ฮฒ | 7B | dDPO | 7.34 | 90.60 | MIT |
Cohere Command | - | RLHF | - | 90.62 | Proprietary |
GPT-3.5-turbo | - | RLHF | 7.94 | 89.37 | Proprietary |
Then, w.r.t. academic benchmarks, we evaluated our model with EleutherAI/lm-eval-harness
via the OpenLLM Leaderboard from HuggingFace H4, and got the following results:
Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K | DROP |
---|---|---|---|---|---|---|---|---|
Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | 57.45 | 77.74 | 12.74 | 9.66 |
argilla/notus-7b-v1 | 52.89 | 64.59 | 84.78 | 63.03 | 54.37 | 79.4 | 15.16 | 8.91 |
โ ๏ธ A data contamination issue has been reported recently by Mistral AI, which led other researchers to explore the contamination within other datasets, and since UltraFeedback (the dataset this model has been fine-tuned on), the TruthfulQA results may be affected, so the score achieved is not realistic. See https://twitter.com/natolambert/status/1730364108078469513.
Usage
Install the required dependencies as:
pip install "transformers>=4.34.0" accelerate --quiet
And then run the following code:
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{
"role": "system",
"content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
},
{"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
generated_text = outputs[0]["generated_text"]
What's next?
From Argilla we want to keep the focus on data, following always a data-first approach. We are currently working on an AI Feedback (AIF) framework in Python to help us collect feedback from LLMs to generate synthetic labelled datasets, similarly to UltraFeedback. We strive for high-quality data, and that's what we'll be focusing on during the next iterations of Notus, aiming to collect better data and experiment with it not just to fine-tune better in-house LLMs, but also to open-source them and contribute to the community.
Acknowledgments
This work would not have been possible without the openbmb/UltraFeedback
dataset neither with the awesome huggingface/alignment-handbook
from the HuggingFace H4 team and their internal support, special mention to Lewis Tunstall and Edward Beeching. Would not have been possible either without the awesome open-source work developed by HuggingFace with the huggingface/trl
library.