๐—๐˜‚๐—ฑ๐—ด๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ ๐—๐˜‚๐—ฑ๐—ด๐—ฒ๐˜€: ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—”๐—น๐—ถ๐—ด๐—ป๐—บ๐—ฒ๐—ป๐˜ ๐—ฎ๐—ป๐—ฑ ๐—ฉ๐˜‚๐—น๐—ป๐—ฒ๐—ฟ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐—ถ๐—ฒ๐˜€ ๐—ถ๐—ป ๐—Ÿ๐—Ÿ๐— ๐˜€-๐—ฎ๐˜€-๐—๐˜‚๐—ฑ๐—ด๐—ฒ๐˜€

Community Article Published June 24, 2024

๐‚๐š๐ง ๐‹๐‹๐Œ๐ฌ ๐ฌ๐ž๐ซ๐ฏ๐ž ๐š๐ฌ ๐ซ๐ž๐ฅ๐ข๐š๐›๐ฅ๐ž ๐ฃ๐ฎ๐๐ ๐ž๐ฌ โš–๏ธ?

We aim to identify the right metrics for evaluating Judge LLMs and understand their sensitivities to prompt guidelines, engineering, and specificity. With this paper, we want to raise caution โš ๏ธ to blindly using LLMs as human proxy.

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

Arxiv link - https://arxiv.org/abs/2406.12624

Tweet Summary - https://x.com/iamsingh96aman/status/1804148173008703509

image/png

Key findings -

๐ŸŒŸ ๐—ง๐—ผ๐—ฝ ๐—ฃ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฒ๐—ฟ๐˜€: Only ๐—š๐—ฃ๐—ง-๐Ÿฐ and ๐—Ÿ๐—Ÿ๐—ฎ๐—บ๐—ฎ-๐Ÿฏ ๐Ÿณ๐Ÿฌ๐—• shine among 9 judge models. However, they still fall short of inter-human annotator agreement.

image/png

๐Ÿ“Š ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐— ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฐ: Scores assigned by judges with 80%+ percent alignment with humans can be 20 points apart! Cohen's kappa is a superior metric.

image/png

โš–๏ธ ๐—ฅ๐—ฎ๐—ป๐—ธ๐—ถ๐—ป๐—ด ๐˜ƒ๐˜€ ๐˜€๐—ฐ๐—ผ๐—ฟ๐—ถ๐—ป๐—ด: Most aligned in scores != most discriminative, in some cases, judge models with low alignment such as Contains (lexical match), and JudgeLM-7B outperform better models in terms of ๐‘Ÿ๐‘Ž๐‘›๐‘˜๐‘–๐‘›๐‘” models, because their biases are more systematic.

image/png

๐Ÿงฉ ๐—Ÿ๐—ฒ๐—ป๐—ถ๐—ฒ๐—ป๐—ฐ๐˜†: Judge LLMs tend to be more lenient than strict.

image/png

๐ŸŽญ ๐—ฉ๐˜‚๐—น๐—ป๐—ฒ๐—ฟ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜†: Judge LLMs can be easily tricked by controlled responses like "Yes," "Sure," and "I don't know."

image/png

๐ŸŽฏ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ผ๐—น๐—น๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜†: It's not easy to steer large models while smaller models get confused by adding too much detail.

image/png

Community

Sign up or log in to comment