OpenAI’s latest o3 model has made a breakthrough that has surprised the artificial intelligence research community. o3 achieved an unprecedented 75.7% on the very difficult ARC-AGI test under standard computing conditions, and the compute-intensive version achieved 87.5%.
While ARC-AGI’s achievement is impressive, it does not prove that the artificial general intelligence (AGI) code has been cracked.
An abstract corpus of reasoning
The ARC-AGI benchmark is based on An abstract corpus of reasoningwhich tests an AI system’s ability to adapt to latest tasks and reveal fluid intelligence. ARC consists of a set of visual puzzles that require understanding of basic concepts equivalent to objects, boundaries, and spatial relationships. While humans can easily solve ARC puzzles with little demonstration, current AI systems struggle with them. ARC has long been considered one of the most difficult AI metrics.
ARC is designed in such a way that it can’t be fooled by training models on hundreds of thousands of examples in the hope of covering all possible puzzle mixtures.
The benchmark consists of a public training set of 400 easy examples. The training set is complemented by a public evaluation set of 400 puzzles that provide greater challenge and assess the generalizability of AI systems. The ARC-AGI Challenge includes private and semi-private test sets of 100 puzzles each that are not made publicly available. They are used to guage potential artificial intelligence systems without the risk of knowledge leaking to the public and contaminating future systems with prior knowledge. Additionally, the competition places limits on the variety of calculations participants can use to be certain that puzzles are not solved by brute force methods.
A breakthrough in solving modern tasks
o1-preview and o1 scored a maximum of 32% in ARC-AGI. Another method developed by the researcher Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to realize 53%, the highest rating before o3.
In blog entryFrançois Chollet, creator of ARC, described the o3’s performance as “a surprising and important step-function boost to AI capabilities, demonstrating novel task adaptability never before seen in GPT family models.”
It must be noted that using more calculations in previous generation models didn’t achieve these results. For context, it took 4 years for the models to progress from 0% on GPT-3 in 2020 to only 5% on GPT-4o in early 2024. While we do not know much about the architecture of o3, we will make certain that it is not an order of magnitude greater than its predecessors.
“This is not just an incremental improvement, but a real breakthrough, marking a qualitative change in the capabilities of artificial intelligence compared to the previous limitations of the LLM,” Chollet wrote. “o3 is a system that can adapt to tasks it has never encountered before, possibly achieving human-level performance in the ARC-AGI domain.”
It’s price noting that o3 performance on ARC-AGI comes at a high price. In the low-computation configuration, it costs the model between $17 and $20 and 33 million tokens to unravel each puzzle, while in the high-computation budget, the model uses roughly 172 times more processing power and billions of tokens per problem. However, as application costs proceed to diminish, these numbers could be expected to turn out to be more reasonable.
A brand new paradigm in LLM reasoning?
The key to solving latest problems is what Chollet and other scientists call “program synthesis.” A pondering system should find a way to develop small programs to unravel very specific problems and then mix these programs to unravel more complex problems. Classic language models eat a lot of data and contain a wealthy set of internal programs. However, they lack compositionality, which prevents them from solving puzzles beyond the scope of their training.
Unfortunately, there is little or no information about how O3 works under the hood and scientists’ opinions differ here. Chollet speculates that o3 uses a style of program synthesis that uses chain-of-thought (CoT) reasoning and a search engine combined with a reward model that evaluates and refines solutions as the model generates tokens. This is just like what open source reasoning models have been exploring over the last few months.
Other scientists, equivalent to Nathan Lambert of the Allen Institute for AI suggest that “o1 and o3 may actually be just a step forward from a single language model.” On the day of the o3 announcement, Nat McAleese, a researcher at OpenAI, published on X that o1 was “just an LLM trained in RL. o3 is fed by further scaling of RL beyond o1.”
On the same day, Denny Zhou of Google DeepMind’s reasoning team called the combination of search and ongoing reinforcement learning a “dead-end” approach.
“The most beautiful thing about LLM reasoning is that the thought process is generated in an autoregressive way, rather than relying on searching (e.g. mcts) of the generation space, either through a well-refined model or a carefully designed prompt,” he said. published on X.
While the details of the o3 reasons could seem trivial in comparison with the ARC-AGI breakthrough, they might thoroughly define the next paradigm shift in LLM training. There is currently an ongoing debate as as to whether the laws of scaling LLM through training data and computation have hit a wall. Whether scaling testing time depends on higher training data or different inference architectures may determine the next path forward.
Not AGI
The name ARC-AGI is misleading and some people associate it with the AGI solution. However, Chollet emphasizes that “ARC-AGI is not an acid test for AGI.”
“Passing ARC-AGI is not the same as achieving AGI, and honestly, I don’t think o3 is AGI yet,” he writes. “o3 still fails at some very easy tasks, indicating fundamental differences from human intelligence.”
Furthermore, it notes that o3 cannot learn these skills on its own and relies on external verifiers for inference and on human-labeled reasoning chains during training.
Other scientists have identified flaws in OpenAI’s reported results. For example, the model was fine-tuned on the ARC training set to realize state-of-the-art results. “The person solving should not need too detailed ‘training’ either in the field itself or in each specific task,” writes the scientist Melanie Mitchell.
To test whether these models have the form of abstraction and reasoning that the ARC benchmark was designed to measure, Mitchell suggests “testing whether these systems can adapt to variants of specific tasks or to reasoning tasks using the same concepts but in different domains than ARC. “
Chollet and his team are currently working on a latest benchmark that challenges o3 and potentially lowers its rating to below 30% even with a large computational budget. Meanwhile, humans would find a way to unravel 95% of the puzzles without any training.
“You’ll know AGI is here when it becomes simply impossible to create tasks that are easy for regular humans but difficult for artificial intelligence,” writes Chollet.