google artificial intelligence logo ai
© GoogleGoogle's Gemini
Is 90 percent accuracy good enough for a search robot?

Google doesn't much like this test. Google spokesperson Ned Adriance tells the Times that Google believes SimpleQA contains incorrect information. Its model evaluations often rely on a similar test called SimpleQA Verified, which uses a smaller set of questions that have been more thoroughly vetted. "This study has serious holes," Adriance told the Times. "It doesn't reflect what people are actually searching on Google."

Benchmark problems

Evaluating new AI models sometimes feels more like art than science, which is part of the problem. Every company has its own preferred way of demonstrating what a model can do, and the non-deterministic nature of gen AI can make it hard to verify anything. These robots can get a factual question right and then completely miss it if you rerun the query immediately. Oumi even uses AI tools to run its assessments, and those models can hallucinate, too.

The other wrinkle is that AI Overviews isn't a single monolithic model. Google told Ars Technica that it uses the "right model" for each query. While AI Overviews would get the best answers from always running Gemini 3.1 Pro, that's slow and expensive. To load things promptly on a search page, the overview uses faster Gemini Flash models when possible (which appears to be most of the time).

Google's response to this report is telling. In the realm of AI factuality, 9 out of 10 isn't even that bad. Google has recently published benchmarks for new model releases featuring measurements of factuality in the range of 60 to 80 percent — these tests are run without tools like web search. Grounding an AI with more data, like the wealth of human knowledge on the Internet, does make it more accurate than the naked model itself. However, the truth is in the blue links somewhere, and AI Overviews encourages people to accept its sometimes inaccurate summaries instead of checking those sources manually.

While Google says the Times' results don't match what people see, you have to wonder how the company could even know that. You've probably seen mistakes in AI Overviews — we all have because that's just how generative AI works. As Google itself reminds you at the bottom of every overview: "AI can make mistakes, so double-check responses."