๐๐๐ฑ๐ด๐ถ๐ป๐ด ๐๐ต๐ฒ ๐๐๐ฑ๐ด๐ฒ๐: ๐๐๐ฎ๐น๐๐ฎ๐๐ถ๐ป๐ด ๐๐น๐ถ๐ด๐ป๐บ๐ฒ๐ป๐ ๐ฎ๐ป๐ฑ ๐ฉ๐๐น๐ป๐ฒ๐ฟ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ถ๐ฒ๐ ๐ถ๐ป ๐๐๐ ๐-๐ฎ๐-๐๐๐ฑ๐ด๐ฒ๐
๐๐๐ง ๐๐๐๐ฌ ๐ฌ๐๐ซ๐ฏ๐ ๐๐ฌ ๐ซ๐๐ฅ๐ข๐๐๐ฅ๐ ๐ฃ๐ฎ๐๐ ๐๐ฌ โ๏ธ?
We aim to identify the right metrics for evaluating Judge LLMs and understand their sensitivities to prompt guidelines, engineering, and specificity. With this paper, we want to raise caution โ ๏ธ to blindly using LLMs as human proxy.
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes
Arxiv link - https://arxiv.org/abs/2406.12624
Tweet Summary - https://x.com/iamsingh96aman/status/1804148173008703509
Key findings -
๐ ๐ง๐ผ๐ฝ ๐ฃ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฒ๐ฟ๐: Only ๐๐ฃ๐ง-๐ฐ and ๐๐๐ฎ๐บ๐ฎ-๐ฏ ๐ณ๐ฌ๐ shine among 9 judge models. However, they still fall short of inter-human annotator agreement.
๐ ๐๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ: Scores assigned by judges with 80%+ percent alignment with humans can be 20 points apart! Cohen's kappa is a superior metric.
โ๏ธ ๐ฅ๐ฎ๐ป๐ธ๐ถ๐ป๐ด ๐๐ ๐๐ฐ๐ผ๐ฟ๐ถ๐ป๐ด: Most aligned in scores != most discriminative, in some cases, judge models with low alignment such as Contains (lexical match), and JudgeLM-7B outperform better models in terms of ๐๐๐๐๐๐๐ models, because their biases are more systematic.
๐งฉ ๐๐ฒ๐ป๐ถ๐ฒ๐ป๐ฐ๐: Judge LLMs tend to be more lenient than strict.
๐ญ ๐ฉ๐๐น๐ป๐ฒ๐ฟ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐: Judge LLMs can be easily tricked by controlled responses like "Yes," "Sure," and "I don't know."
๐ฏ ๐๐ผ๐ป๐๐ฟ๐ผ๐น๐น๐ฎ๐ฏ๐ถ๐น๐ถ๐๐: It's not easy to steer large models while smaller models get confused by adding too much detail.