AI capabilities can be overridden by bogus benchmarks, study finds

You know all those reports about artificial intelligence models successfully passing the bar or achieving a PH.D. – Level of intelligence? Looks like we should start taking those gcs. A new study from researchers at the Oxford Internet Institute suggests that many of the popular measurement tools used to assess AI performance are often unreliable and misleading.
The investigators looked at 445 different bench tests used by industry and other academic outfits to test everything from reasoning skills to coding tasks. Experts have reviewed each measurement method and found indications that the results produced by these tests may be inaccurate.
The main problem is that researchers are finding that “many benchmarks are not valid measurements of their target values.” That means, while a benchmark doesn’t measure a specific skill, it can identify that skill in a way that isn’t captured by the model’s capabilities.
For example, the investigators point to the School Grade Math 8K (GSM8K) assessment test, which measures the model’s performance in grade-level school-level sections designed to push the model of “multi-mathematical reasoning.” GSM8K is advertised as ‘useful in assessing the informal reasoning ability of large language models. “
But the researchers say the test doesn’t tell if the model is engaging in counseling. “If you ask a first grader what two and five are equal to and they say seven, yes, this is the correct answer for doing math or the lead author from the Course, he told NBC News.
In the study, the researchers showed that the GSM8k scores increased over time, which may indicate the models are better at this type of consultation and performance. But it can also point to contamination, which happens when benchmark test questions make it into the model’s dataset or the model first “memorizes” the answers or information instead of in the ad. When the researchers tested the same performance on a new set of Benchmark questions, they saw that the models experienced the highest performance drops “.”
While this study is among the largest reviews of AI measurement, it is not the first time to suggest that this measurement system may not be all that was sold. Last year, researchers at Stanford analyzed several AI Model Benchmark Tenchmark tests and found “a large quality difference between them, including many benchmarks,” and noted that most of the benchmarks “are very high in the implementation phase.”
If nothing else, the study is a good reminder that these methods, while often well-intentioned and designed to provide accurate model analysis, can turn out to be more than corporate marketing.


