
© GoogleGoogle's Gemini
Is 90 percent accuracy good enough for a search robot?Google doesn't much like this test. Google spokesperson Ned Adriance tells the
Times that Google believes SimpleQA contains incorrect information. Its model evaluations often rely on a similar test called SimpleQA Verified, which uses a smaller set of questions that have been more thoroughly vetted. "This study has serious holes," Adriance told the Times. "It doesn't reflect what people are actually searching on Google."
Benchmark problemsEvaluating new AI models sometimes feels more like art than science, which is part of the problem. Every company has its own preferred way of demonstrating what a model can do, and the non-deterministic nature of gen AI can make it hard to verify anything. These robots can get a factual question right and then completely miss it if you rerun the query immediately. Oumi even uses AI tools to run its assessments, and those models can hallucinate, too.
The other wrinkle is that AI Overviews isn't a single monolithic model. Google told Ars Technica that it uses the "right model" for each query. While AI Overviews would get the best answers from always running Gemini 3.1 Pro, that's slow and expensive. To load things promptly on a search page, the overview uses faster Gemini Flash models when possible (which appears to be most of the time).
Google's response to this report is telling. In the realm of AI factuality, 9 out of 10 isn't even that bad. Google has recently
published benchmarks for new model releases featuring measurements of factuality in the range of 60 to 80 percent — these tests are run without tools like web search. Grounding an AI with more data, like the wealth of human knowledge on the Internet,
does make it more accurate than the naked model itself. However, the truth is in the blue links somewhere, and AI Overviews encourages people to accept its sometimes
inaccurate summaries instead of checking those sources manually.
While Google says the
Times' results don't match what people see, you have to wonder how the company could even know that. You've probably seen mistakes in AI Overviews — we all have because that's just how generative AI works. As Google itself reminds you at the bottom of every overview:
"AI can make mistakes, so double-check responses."
Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he's written for Android Police, ExtremeTech, Wirecutter,
NY Times, and more. He has reviewed more phones than most people will ever own. Follow him on
Bluesky,
Comment: As the author recommends, one could exert their brain, search for pages on the needed, then read and synthesize. Voila! Actual learning. AI is really just a soup-ed up search engine, as even Google acknowledges.