Reward Model pretrained on openai/webgpt_comparison

Reward model finetuned from existing pretrain model.

Things that aligned with the orignal papers

  • Overfits easily using rank loss

  • Small learning rate

Different from the papers

  • Small model performs bad due to lack of world knowledge, since the validation accuracy doesn't even reach 60%. OpenAI RM had 6B parameters.

  • Train using a 80-20 train-validation split on torch AMP settings

Other models I had tried

  • bloomz-560m : embedding size doesn't worth the training, since this dataset only contain english prompt

  • gpt2-large : not stable

  • gpt2-base : not stable

Performance on validation split

model val acc val loss (rank loss)
roberta-base 56.21 0.71
roberta-large 57.89 0.67
electra-base 57.02 0.70
electra-large 58.75 0.69

Tensorboard logs are located under runs/

Note:

  • You will have to reweight this model output such that the mean rewards equals to 0
Downloads last month
11
Safetensors
Model size
125M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train theblackcat102/roberta-base-webgpt-rm