https://www.fastcompany.com/91515360/ar ... -benchmarkChollet believes OpenAI spent “tens of millions” on compute in 2025 to train models specifically for ARC-AGI-2, using publicly available ARC puzzle samples to generate additional training data. “What this amounts to is preemptive brute forcing … by trying to guess in advance every possible task,” he says.
At any rate, the tactics worked: top scores rose to 40–50% by December 2025, Knoop says.
“I expect the same will happen with ARC-3, but with ARC-3 it’s going to be harder,” Chollet says. “It’ll be more expensive.”
I predict the old pattern will repeat: LLMs will start scoring high on ARC-AGI-3, True Believers will point to it as proof that AGI is a few seconds from being created, Constant Cynics will scoff and point out all the examples of those LLMs lacking intelligence, Chollet will say the tech companies figured out how to game the test, and he will release ARC-AGI-4.
In a bad timeline, this pattern and the associated arguing and hand-waving go on for the next 25 years.