It feels to me like benchmarks only represent a tiny portion of what we actually use and want LLMs for, and I doubt I'm alone in that sentiment.
Beyond this, the actual evaluations of responses from models are extremely strict and often use even rudimentary NLP techniques when, at this point, we have LLMs themselves that are more than capable at evaluating and scoring responses.
It feels like we've made great strides in the quality of LLMs themselves, but almost no change in the quality of how we benchmark.
If you have any ideas for how benchmarks could be a better assessment of an LLM, or know of good research papers that tackle this challenge, please share!