What is Reasoning in AI?
A Conversation with an AI Expert
We had a chance to sit down with Dr. Lucian Galescu, Openstream.ai's Director of Dialogue Research, to ask him about Reasoning in Artificial Intelligence, LLMs, and Hallucinations and for his help in explaining the ideas to people not familiar with them.
Q: WHAT IS REASONING IN AI?
A: Reasoning is the process by which conclusions are methodically drawn from premises or evidence. If the premises are true and the reasoning steps (inferences) are sound, the conclusion is also guaranteed, or at least very likely, to be true.
Q: CAN LLMS REASON?
A: The short answer is no, they cannot. LLMs have no ability to ascertain whether anything they generate is factually true. Further, they are not specifically trained to follow any sort of rigorous process for drawing conclusions from premises.
While they may have seen correct facts and correct inferences in their training data, those are merely incidental — there is no guarantee that incorrect facts and incorrect inferences were not also present.
Most mainstream LLMs from OpenAI, Google and others have been trained on the vastness of the internet and the internet is not infallible.
Q: BUT IT SEEMS LIKE EVERY DAY THERE’S ANOTHER PAPER ON ARXIV ABOUT LLMS BEING ABLE TO REASON! ARE THEY ALL WRONG?
A: Yes and no. In the early days of LLMs people believed they have “emerging” capabilities, that is, that they are capable of performing tasks they had not been specifically trained to do. Few mention “emergence” any more. But one such capability was supposedly being able to perform reasoning tasks, such as solving simple math problems. In time, new and larger LLMs proved to be better at such tasks. Yet it became clear that, despite the hype, they were just not very good at it, and the mistakes they made were often quite basic, which demonstrated that whatever “reasoning” capability LLMs were using was not sufficiently general and certainly not rigorous.
To take a very simple example, even adding two numbers proved to be a difficult task. We learn in school to add digits and then we learn how adding several numbers, of any length, can be decomposed into a series of smaller addition problems involving just digits. With sufficient attention to detail to avoid execution errors, we can perform correctly any addition problem. But LLMs do not by themselves learn such rules, nor are they able to perform task decomposition.
However, once a challenge has been put out, researchers are lured to meet it. So they tried to find ways to constrain LLMs to mimic the processes by which humans reason. Hence new methods have been proposed, with names such as Chain of Thought, Tree of Thought, Graph of Thought, etc., which try to get answers to problems not in one flash, as the original LLMs, but by generating solutions step by step, considering alternatives, etc. Such methods did, indeed, result in better performance.
But they’re still not very good. Much like with the simple arithmetic problem above, they do not learn a higher-level theory of how to solve problems, so they are still constrained by the approximate memory of what they’d seen in the training data. And there just aren’t great numbers of fully-fleshed step-by-step solutions out there for all the problems people might want to solve.
In fact, it was not hard for other researchers to demonstrate the brittleness of these methods: simple changes to the way problems were presented, or increasing the complexity of the problems so that more steps were required for the solution to be found (without requiring any new knowledge) led to precipitous drops in performance from the published results.
Q: OK, BUT IT SEEMS LLMS ARE GETTING EVER BETTER; IF THEY CAN’T SOLVE SOME PROBLEMS NOW, THE NEXT VERSION (OR THE ONE AFTER THAT) WILL, NO?
A: It is hard to infer from any measure of progress being published (e.g., performance on some benchmark), whether that translates into better performance on any particular real-life problem. Recently, “large reasoning models” such as OpenAI’s o1 model have been unveiled, which are advertised as “thinking” before answering. o1 is internally generating Chains of Thought (CoT), before answering.
However, those chains of thought are not inspectable — in fact, OpenAI prohibits anyone from even trying to surface them. Therefore, this model is no more transparent than any regular LLM about its internal “thinking”, though it tends to more often produce step-by-step solutions.
Still, the model is marred by the same problems as the original CoT method. Its increased performance stems from having been trained on additional, “tailored” data; we don’t know what that data looks like, but it likely contains step-by-step solutions to many problems in math and science. Anyone trying to solve these kinds of problems will be better off using o1 than any other models, though there are still no guarantees that the model won’t fail.
But for the rest of us, who have other reasoning problems, o1 and its successors are unlikely to always solve them correctly. The problem is really with the architecture of these models: they are not generic problem solvers, and throwing ever more (synthetic) data at them is not going to change that. o1 is also very slow and expensive to use!
Q: SO, IS AI JUST NOT CAPABLE OF REASONING?
A: Actually, that is not at all the case. Before the advent of neural networks and the current transformer-based architectures, which is what most people think of when they use the term “AI”, the field of Artificial Intelligence spent decades tackling the problem of automated reasoning. There are many formalisms and tools available for carrying out sound inferences based on symbolic representations.
They can solve correctly practically all real-life problems one might encounter, with no special hardware; while speed depends on the complexity of the problem, most can be solved in fractions of a second. Also, many such tools are open-source.
Q: WELL, THEN, WHY AREN’T WE ALL USING THOSE TOOLS FOR REASONING?
A: Symbolic reasoners require that problems are set up in specialized languages. Most people are just not trained to use them. One can pose a problem to an LLM using natural language and the answer — correct or not — will also come out in easy-to-understand language.
Symbolic reasoners aren’t equipped to deal with the ambiguities of natural language; they require strict adherence to the formal language of mathematical logic. The solutions to problems will be a sequence of formulas in the same formal language, which can be easily understood by experts who know what those formulas mean, but will be incomprehensible to anyone else.
Q: OK, SO LLMS CAN HANDLE LANGUAGE REASONABLY WELL, BUT ARE BAD AT REASONING; SYMBOLIC REASONERS ARE VERY GOOD AT REASONING, BUT THEY ARE BAD AT LANGUAGE. IS THE SITUATION HOPELESS?
Not at all! LLMs have been shown to be fairly competent at mapping natural language to code. The formal languages used by symbolic reasoners are a kind of code. So LLMs can be used to translate problems from natural language to symbolic representations, then symbolic reasoners can perform their magic and deliver guaranteed solutions, which can then be translated back to natural language using LLMs.
Q: SO THE TWO APPROACHES CAN WORK TOGETHER?
A: Systems that work this way are called neuro-symbolic systems. This is a new paradigm for creating conversational systems, where some parts are implemented using neural components and other parts are using symbolic representations.
Q: HOW DO THESE DIFFERENT PARTS WORK TOGETHER IN A NEURO-SYMBOLIC SYSTEM?
A: There isn’t a singular architecture for a neuro-symbolic system. For example, one might carry on a chat with a neural LLM-based system, and, when a problem needs to be solved, the problem could be translated to a symbolic representation and dispatched to a symbolic reasoner, and the LLM will take the result, translate it into language and continue the conversation. This architecture is sometimes called “LLM with tools”, since the reasoner is used by the LLM as a specialized tool, to perform a specific kind of computation (the architecture is amenable to using not one, but many such tools), but the dialogue is, in fact, carried by an LLM. This kind of architecture is well-suited for general question-answering, where problems to be solved are local, and typically can be solved in one call.
In another architecture, the conversation is handled by a symbolic dialogue engine, which is inherently capable of reasoning, but the translation back and forth between language and the dialogue engine’s internal representation is handled by an LLM; that may include resolving ambiguities and/or supply general commonsense knowledge to the reasoner (using the same LLM or a different, fine-tuned model). Such an architecture is particularly well-suited for goal-oriented conversations, where the system needs to develop long-term plans for solving the user’s problem(s), and may require a lot of back-and-forth with the user to understand what the goal is, obtain the proper inputs from the user, etc. Openstream is using both of these architectures.