NHacker Next
login
▲Is chain-of-thought AI reasoning a mirage?seangoedecke.com
173 points by ingve 23 hours ago | 149 comments
Loading comments...
tomhow 10 hours ago [-]
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens - https://news.ycombinator.com/item?id=44872850 - Aug 2025 (130 comments)
Terr_ 4 hours ago [-]
A regular "chatting" LLM is a document generator incrementally extending a story about a conversation between a human and a robot... And through that lens, I've been thinking "chain of thought" seems like basically the same thing but with a film noir styling-twist.

The LLM is trained to include an additional layer of "unspoken" text in the document, a source of continuity which substitutes for how the LLM has no other memories or goals to draw from.

"The capital of Assyria? Those were dangerous questions, especially in this kind of town. But rent was due, and the bottle in my drawer was empty. I took the case."

areeh 2 hours ago [-]
Oh wow, now I want a chain of thought rewriter that makes the combination of chat and CoT put together follow this style
LudwigNagasena 19 hours ago [-]
> The first is that reasoning probably requires language use. Even if you don’t think AI models can “really” reason - more on that later - even simulated reasoning has to be reasoning in human language.

That is an unreasonable assumption. In case of LLMs it seems wasteful to transform a point from latent space into a random token and lose information. In fact, I think in near future it will be the norm for MLLMs to "think" and "reason" without outputting a single "word".

> Whether AI reasoning is “real” reasoning or just a mirage can be an interesting question, but it is primarily a philosophical question. It depends on having a clear definition of what “real” reasoning is, exactly.

It is not a "philosophical" (by which the author probably meant "practically inconsequential") question. If the whole reasoning business is just rationalization of pre-computed answers or simply a means to do some computations because every token provides only a fixed amount of computation to update the model's state, then it doesn't make much sense to focus on improving the quality of chain-of-thought output from human POV.

safety1st 18 hours ago [-]
I'm pretty much a layperson in this field, but I don't understand why we're trying to teach a stochastic text transformer to reason. Why would anyone expect that approach to work?

I would have thought the more obvious approach would be to couple it to some kind of symbolic logic engine. It might transform plain language statements into fragments conforming to a syntax which that engine could then parse deterministically. This is the Platonic ideal of reasoning that the author of the post pooh-poohs, I guess, but it seems to me to be the whole point of reasoning; reasoning is the application of logic in evaluating a proposition. The LLM might be trained to generate elements of the proposition, but it's too random to apply logic.

_diyar 13 hours ago [-]
We expect this approach to work because it's currently the best working approach. Nothing else comes close.

Using symbolic language is a good idea in theory, but in practice it doesn't scale as well as auto-regression + RL.

The IMO results of DeepMind illustrate this well: In 2024, they solved it using AlphaProof and AlphaGeometry, using the Lean language as a formal symbolic logic[1]. In 2025 they performed better and faster by just using a fancy version of Gemini, only using natural language[2].

[1] https://deepmind.google/discover/blog/ai-solves-imo-problems...

[2] https://deepmind.google/discover/blog/advanced-version-of-ge...

Note: I agree with the notion of the parent comment that letting the models reason in latent space might make sense, but that's where I'm out of my depth.

safety1st 8 hours ago [-]
Very interesting stuff, thanks!
gmadsen 56 minutes ago [-]
because what can be embedded in billions of parameters is highly unintuitive to common sense and an active area of research. We do it because it works.

One other point, the platonic ideal of reasoning is not even an approximation for human reason. The idea that you take away emotion and you end up with Spock is a fantasy. All neuroscience and psychology research point to the necessary and strong coupling of actions/thoughts with emotions. you don't have a functional system with just logical deduction. At a very basic level it is not functional

shannifin 18 hours ago [-]
Problem is, even with symbolic logic, reasoning is not completely deterministic. Whether one can get to a set of given axioms from a given proposition is sometimes undecidable.
bubblyworld 6 hours ago [-]
I don't think this is really a problem. The general problem of finding a proof from some axioms to some formula is undecidable (in e.g. first order logic). But that doesn't tell you anything about specific cases, in the same way that we can easily tell whether some specific program halts, like this one:

"return 1"

shannifin 14 minutes ago [-]
True, I was rather pointing out that being able to parse symbolic language deterministically doesn't imply that we could then "reason" deterministically in general; the reasoning would still need to involve some level of stochasticism. Whether or not that's a problem in practice depends on specifics.
wonnage 5 hours ago [-]
My impression of LLM “reasoning” is that works more like guardrails. Perhaps the space of possible responses to the initial prompt is huge and doesn’t exactly match any learned information. All the text generated during reasoning is high strength. So placing it in the context should hopefully guide answer generation towards something reasonable.

It’s the same idea as manually listing a bunch of possibly-useful facts in the prompt, but the LLM is able to generate plausible sounding text itself.

I feel like this relates to why LLM answers tend to be verbose too, it needs to put the words out there in order to stay coherent.

vmg12 18 hours ago [-]
Solutions to some of the hardest problems I've had have only come after a night of sleep or when I'm out on a walk and I'm not even thinking about the problem. Maybe what my brain was doing was something different from reasoning?
andybak 18 hours ago [-]
This is a very important point and mostly absent from the conversation.

We have many words that almost mean the same thing or can mean ment different things - and conversations about intelligence and consciousness are riddled with them.

tempodox 16 hours ago [-]
> This is a very important point and mostly absent from the conversation.

That's because when humans are mentioned at all in the context of coding with “AI”, it's mostly as bad and buggy simulations of those perfect machines.

jojobas 9 hours ago [-]
At the very least intermediate points of one's reasoning are grounded in reality.
kazinator 19 hours ago [-]
Not all reasoning requires language. Symbolic reasoning uses language.

Real-time spatial reasoning like driving a car and not hitting things does not seem linguistic.

Figuring out how to rotate a cabinet so that it will clear through a stairwell also doesn't seem like it requires language, only to communicate the solution to someone else (where language can turn into a hindrance, compared to a diagram or model).

llllm 18 hours ago [-]
Pivot!
kazinator 16 hours ago [-]
Can we be Friends?
kromem 6 hours ago [-]
Latent space reasoners are a thing, and honestly we're probably already seeing emergent latent space reasoners starting to end up embedded into the weights as new models train on extensive reasoning synthetics.

If Othello-GPT can build a board in latent space given just the moves, can an exponentially larger transformer build a reasoner in their latent space given a significant number of traces?

limaoscarjuliet 17 hours ago [-]
> In fact, I think in near future it will be the norm for MLLMs to "think" and "reason" without outputting a single "word".

It will be outputting something, as this is the only way it can get more compute - output a token, then all context + the next token is fed through the LLM again. It might not be presented to the user, but that's a different story.

pornel 14 hours ago [-]
You're looking at this from the perspective of what would make sense for the model to produce. Unfortunately, what really dictates the design of the models is what we can train the models with (efficiently, at scale). The output is then roughly just the reverse of the training. We don't even want AI to be an "autocomplete", but we've got tons of text, and a relatively efficient method of training on all prefixes of a sentence at the same time.

There have been experiments with preserving embedding vectors of the tokens exactly without loss caused by round-tripping through text, but the results were "meh", presumably because it wasn't the input format the model was trained on.

It's conceivable that models trained on some vector "neuralese" that is completely separate from text would work better, but it's a catch 22 for training: the internal representations don't exist in a useful sense until the model is trained, so we don't have anything to feed into the models to make them use them. The internal representations also don't stay stable when the model is trained further.

potsandpans 15 hours ago [-]
> It is not a "philosophical" (by which the author probably meant "practically inconsequential") question.

I didn't take it that way. I suppose it depends on whether or not you believe philosophy is legitimate

Terr_ 4 hours ago [-]
> I suppose it depends on whether or not you believe philosophy is legitimate

The only way to declare philosophy illegitimate is to be using legitimate philosophy, so... :p

esafak 9 hours ago [-]
It is not obvious that a continuous space is better for thinking than a discrete one.
AbrahamParangi 10 hours ago [-]
You're not allowed to say that it's not reasoning without distinguishing what is reasoning. Absent a strict definition that the models fail and that some other reasoner passes, it is entirely philosophical.
sdenton4 13 minutes ago [-]
"entirely philosophical"

I don't think this means what you think it means... Philosophers (at least up to Wittgenstein) love constructing and arguing about definitions.

emorning4 18 hours ago [-]
[dead]
brunokim 19 hours ago [-]
I'm unconvinced by the article criticism's, given they also employ their feels and few citations.

> I appreciate that research has to be done on small models, but we know that reasoning is an emergent capability! (...) Even if you grant that what they’re measuring is reasoning, I am profoundly unconvinced that their results will generalize to a 1B, 10B or 100B model.

A fundamental part of applied research is simplifying a real-world phenomenon to better understand it. Dismissing that for this many parameters, for such a simple problem, the LLM can't perform out of distribution just because it's not big enough undermines the very value of independent research. Tomorrow another model with double the parameters may or may not show the same behavior, but that finding will be built on top of this one.

Also, how do _you_ know that reasoning is emergent, and not rationalising on top of a compressed version of the web stored in 100B parameters?

ActionHank 19 hours ago [-]
I think that when you are arguing logic and reason with a group who became really attached to the term vibe-coding you've likely already lost.
mirekrusin 6 hours ago [-]
Feels like running psychology experiment with fruit flies because it's cheaper and extrapolating results to humans because it's almost the same thing but smaller.

I'm sorry but the only hallucination here is that of the authors here. Does it really need to be said again that interesting results happen when you scale up only?

This whole effort would be interesting if they did and plotted result while scaling something up.

NitpickLawyer 22 hours ago [-]
Finally! A good take on that paper. I saw that arstechnica article posted everywhere, and most of the comments are full of confirmation bias, and almost all of them miss the fineprint - it was tested on a 4 layer deep toy model. It's nice to read a post that actually digs deeper and offers perspectives on what might be a good finding vs. just warranting more research.
stonemetal12 20 hours ago [-]
> it was tested on a 4 layer deep toy model

How do you see that impacting the results? It is the same algorithm just on a smaller scale. I would assume a 4 layer model would not be very good, but does reasoning improve it? Is there a reason scale would impact the use of reasoning?

azrazalea_debt 20 hours ago [-]
A lot of current LLM work is basically emergent behavior. They use a really simple core algorithm and scale it up, and interesting things happen. You can read some of anthropic's recent papers to see some of this, like: They didn't expect LLMs could "lookahead" when writing poetry. However, when they actually went in and watched what was happening (there's details on how this "watching" works on their blog/in their studies) they found the LLM actually was planning ahead! That's emergent behavior, they didn't design it to do that, it just started doing due to the complexity of the model.

If (BIG if) we ever do see actual AGI, it is likely to work like this. It's unlikely we're going to make AGI by designing some grand Cathedral of perfect software, it is more likely we are going to find the right simple principles to scale big enough to have AGI emerge. This is similar.

mrspuratic 17 hours ago [-]
On that topic, it seems backwards to me: intelligence is not emergent behaviour of language, rather the opposite.
2 hours ago [-]
danans 8 hours ago [-]
Perception and interpretation can very much be influenced by language (Sapir-Wharf hypothesis), so to the extent that perception and interpretation influence intelligence, it's not clear that the relationship is only in one direction.
zekica 2 hours ago [-]
Am I the exception? When thinking I don't conceptualize things in words - the compression would be too lossy. Maybe because I'm fluent in three languages (one germanic, one romance, one slavic)?
archaeans 4 hours ago [-]
"It would be naïve to imagine that any analysis of experience is dependent on pattern expressed in language."

- Sapir

It's hard to take these discussions on cognition and intelligence seriously when there is so much lossy compression going on.

pinoy420 20 hours ago [-]
[dead]
NitpickLawyer 20 hours ago [-]
There's prior research that finds a connection between model depth and "reasoning" ability - https://arxiv.org/abs/2503.03961

A depth of 4 is very small. It is very much a toy model. It's ok to research this, and maybe someone will try it out on larger models, but it's totally not ok to lead with the conclusion, based on this toy model, IMO.

okasaki 20 hours ago [-]
Human babies are the same algorithm as adults.
mirekrusin 6 hours ago [-]
This analogy would mean very large model that didn't finish training yet.

Tiny model like this is more like doing study on fruit flies and extrapolating results to humans.

archaeans 4 hours ago [-]
Every argument about LLMs of that is a variant of "humans same" is self defeating because it assumes a level of understanding of human cognition and the human brain that doesn't really exist outside of the imagination of people with a poor understanding of neuroscience.
okasaki 4 hours ago [-]
Some humans never attain intelligence beyond early childhood.
modeless 20 hours ago [-]
"The question [whether computers can think] is just as relevant and just as meaningful as the question whether submarines can swim." -- Edsger W. Dijkstra, 24 November 1983
griffzhowl 20 hours ago [-]
I don't agree with the parallel. Submarines can move through water - whether you call that swimming or not isn't an interesting question, and doesn't illuminate the function of a submarine.

With thinking or reasoning, there's not really a precise definition of what it is, but we nevertheless know that currently LLMs and machines more generally can't reproduce many of the human behaviours that we refer to as thinking.

The question of what tasks machines can currently accomplish is certainly meaningful, if not urgent, and the reason LLMs are getting so much attention now is that they're accomplishing tasks that machines previously couldn't do.

To some extent there might always remain a question about whether we call what the machine is doing "thinking" - but that's the uninteresting verbal question. To get at the meaningful questions we might need a more precise or higher resolution map of what we mean by thinking, but the crucial element is what functions a machine can perform, what tasks it can accomplish, and whether we call that "thinking" or not doesn't seem important.

Maybe that was even Dijkstra's point, but it's hard to tell without context...

modeless 18 hours ago [-]
It is strange that you started your comment with "I don't agree". The rest of the comment demonstrates that you do agree.
griffzhowl 15 hours ago [-]
To be more clear about why I disagree the cases are parallel:

We know how a submarine moves through water, whether it's "swimming" isn't an interesting question.

We don't know to what extent a machine can reproduce the cognitive functions of a human. There are substantive and significant questions about whether or to what extent a particular machine or program can reproduce human cognitive functions.

So I might have phrased my original comment badly. It doesn't matter if we use the word "thinking" or not, but it does matter if a machine can reproduce the human cognitive functions, and if that's what we mean by the question whether a machine can think, then it does matter.

modeless 14 hours ago [-]
"We know how it moves" is not the reason the question of whether a submarine swims is not interesting. It's because the question is mainly about the definition of the word "swim" rather than about capabilities.

> if that's what we mean by the question whether a machine can think

That's the issue. The question of whether a machine can think (or reason) is a question of word definitions, not capabilities. The capabilities questions are the ones that matter.

griffzhowl 14 hours ago [-]
> The capabilities questions are the ones that matter.

Yes, that's what I'm saying. I also think there's a clear sense in which asking whether machines can think is a question about capabilities, even though we would need a more precise definition of "thinking" to be able to answer it.

So that's how I'd sum it up: we know the capabilities of submarines, and whether we say they're swimming or not doesn't answer any further question about those capabilities. We don't know the capabilities of machines; the interesting questions are about what they can do, and one (imprecise) way of asking that question is whether they can think

modeless 13 hours ago [-]
> I also think there's a clear sense in which asking whether machines can think is a question about capabilities, even though we would need a more precise definition of "thinking" to be able to answer it.

The second half of the sentence contradicts the first. It can't be a clear question about capabilities without widespread agreement on a more rigorous definition of the word "think". Dijkstra's point is that the debate about word definitions is irrelevant and a distraction. We can measure and judge capabilities directly.

griffzhowl 12 hours ago [-]
> Dijkstra's point is that the debate about word definitions is irrelevant and a distraction.

Agreed, and I've made this point a few times, so it's ironic we're going back and forth about this.

> The second half of the sentence contradicts the first.

I'm not saying the question is clear. I'm saying there's clearly an interpretation of it as a question about capabilities.

wizzwizz4 19 hours ago [-]
https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD898... provides the context. I haven't re-read it in the last month, but I'm pretty sure you've correctly identified Dijkstra's point.
archaeans 4 hours ago [-]
And the often missed caveat is that we should only care about whether the software does what it is supposed to do.

Under that light, LLMs are just buggy and have been for years. Where is the LLM that does what it says it should do? "Hallucination" and "do they reason" are distractions. They fail. They're buggy.

mdp2021 20 hours ago [-]
But the topic here is whether some techniques are progressive or not

(with a curious parallel about whether some paths in thought are dead-ends - the unproductive focus mentioned in the article).

draw_down 20 hours ago [-]
[dead]
hungmung 19 hours ago [-]
Chain of thought is just a way of trying to squeeze more juice out of the lemon of LLM's; I suspect we're at the stage of running up against diminishing returns and we'll have to move to different foundational models to see any serious improvement.
archaeans 4 hours ago [-]
The so-called "scaling laws" are expressing diminishing returns.

How is it that "if we grow resources used exponentially errors decrease linearly" ever seen as a good sign?

js8 18 hours ago [-]
I think LLM's chain of thought is reasoning. When trained, LLM sees lot of examples like "All men are mortal. Socrates is a man." followed by "Therefore, Socrates is mortal.". This causes the transformer to learn rule "All A are B. C is A." is often followed by "Therefore, C is B." And so it can apply this logical rule, predictively. (I have converted the example from latent space to human language for clarity.)

Unfortunately, sometimes LLM also learns "All A are C. All B are C." is followed by "Therefore, A is B.", due to bad example in the training data. (More insidiously, it might learn this rule only in a special case.)

So it learns some logic rules but not consistently. This lack of consistency will cause it to fail on larger problems.

I think NNs (transformers) could be great in heuristic suggesting which valid logical rules (could be even modal or fuzzy logic) to apply in order to solve a certain formalized problem, but not so great at coming up with the logic rules themselves. They could also be great at transforming the original problem/question from human language into some formal logic, that would then be resolved using heuristic search.

handoflixue 9 hours ago [-]
Humans are also notoriously bad at this, so we have plenty of evidence that this lack of consistency does indeed cause failures on larger problems.
js8 7 hours ago [-]
Yes, humans fail at this, that's why we need technology tnat doesn't simply emulate humans, but tries to be more reliable than us.
stonemetal12 20 hours ago [-]
When Using AI they say "Context is King". "Reasoning" models are using the AI to generate context. They are not reasoning in the sense of logic, or philosophy. Mirage, whatever you want to call it, it is rather unlike what people mean when they use the term reasoning. Calling it reasoning is up there with calling generating out put people don't like hallucinations.
adastra22 20 hours ago [-]
You are making the same mistake OP is calling out. As far as I can tell “generating context” is exactly what human reasoning is too. Consider the phrase “let’s reason this out” where you then explore all options in detail, before pronouncing your judgement. Feels exactly like what the AI reasoner is doing.
stonemetal12 20 hours ago [-]
"let's reason this out" is about gathering all the facts you need, not just noting down random words that are related. The map is not the terrain, words are not facts.
adastra22 10 hours ago [-]
So when you say “that’s the reason this out” you open up Wikipedia or reference textbooks and start gathering facts? I mean that’s great, but I certainly don’t. Most of the time “gathering facts” means recalling relevant info from memory. Which is roughly what the LLM is doing, no?
20 hours ago [-]
energy123 20 hours ago [-]
Performance is proportional to the number of reasoning tokens. How to reconcile that with your opinion that they are "random words"?
kelipso 20 hours ago [-]
Technically random can have probabilities associated with them.. Casual speech, random means equal probabilities, or we don’t know the probabilities. But for LLM token output, it does estimate the probabilities.
energy123 10 hours ago [-]
Greedy decoding isn't random.
blargey 19 hours ago [-]
s/random/statistically-likely/g

Reducing the distance of each statistical leap improves “performance” since you would avoid failure modes that are specific to the largest statistical leaps, but it doesn’t change the underlying mechanism. Reasoning models still “hallucinate” spectacularly even with “shorter” gaps.

ikari_pl 19 hours ago [-]
What's wrong with statistically likely?

If I ask you what's 2+2, there's a single answer I consider much more likely than others.

Sometimes, words are likely because they are grounded in ideas and facts they represent.

blargey 18 hours ago [-]
> Sometimes, words are likely because they are grounded in ideas and facts they represent.

Yes, and other times they are not. I think the failure modes of a statistical model of a communicative model of thought are unintuitive enough without any added layers of anthropomorphization, so there remains some value in pointing it out.

ThrowawayTestr 18 hours ago [-]
Have you read the chain of thought output from reasoning models? That's not what it does.
CooCooCaCha 19 hours ago [-]
Reasoning is also about processing facts.
kelipso 20 hours ago [-]
No, people make logical connections, make inferences, make sure all of it fits together without logical errors, etc.
adastra22 10 hours ago [-]
How do they do that? Specifically, how? Moment by moment what does that look like? Usually it involves e.g. making a statement and “noticing” a contradiction in that statement. Very similar to how an LLM reasons.

I think a lot of people here think people reason like a mathematical theorem prover, like some sort of platonic ideal rationalist. That’s not how real brains work though.

pixl97 20 hours ago [-]
These people you're talking about must be rare online, as human communication is pretty rife with logical errors.
mdp2021 19 hours ago [-]
Since that November in which this technology boomed we have been much too often reading "people also drink from puddles", as if it were standard practice.

That we implement skills, not deficiencies, is a basic concept that is getting to such a level of needed visibility it should probably be inserted in the guidelines.

We implement skills, not deficiencies.

kelipso 19 hours ago [-]
You shouldn’t be basing your entire worldview around the lowest common denominator. All kinds of writers like blog writers, novelists, scriptwriters, technical writers, academics, poets, lawyers, philosophers, mathematicians, and even teenage fan fiction writers do what I said above routinely.
viccis 19 hours ago [-]
>As far as I can tell “generating context” is exactly what human reasoning is too.

This was the view of Hume (humans as bundles of experience who just collect information and make educated guesses for everything). Unfortunately, it leads to philosophical skepticism, in which you can't ground any knowledge absolutely, as it's all just justified by some knowledge you got from someone else, which also came from someone else, etc., and eventually you can't actually justify any knowledge that isn't directly a result of experience (the concept of "every effect has a cause" is a classic example).

There have been plenty of epistemological responses to this viewpoint, with Kant's view, of humans doing a mix of "gathering context" (using our senses) but also applying universal categorical reasoning to schematize and understand / reason from the objects we sense, being the most well known.

I feel like anyone talking about the epistemology of AI should spend some time reading the basics of all of the thought from the greatest thinkers on the subject in history...

js8 14 hours ago [-]
> I feel like anyone talking about the epistemology of AI should spend some time reading the basics

I agree, I think the problem with AI is we don't know or haven't formalized enough what epistemology should AGI systems have. Instead, people are looking for shortcuts, feeding huge amount of data into the models, hoping it will self-organize into something that humans actually want.

viccis 13 hours ago [-]
It's partly driven by a hope that if you can model language well enough, you'll then have a model of knowledge. Logical positivism tried that with logical systems, which are much more precise languages of expressing facts, and it still fell on its face.
adastra22 9 hours ago [-]
FYI this posts comes off as incredibly pretentious. You think we haven’t read the same philosophy?

This isn’t about epistemology. We are talking about psychology. What does your brain do when we “reason things out”? Not “can we know anything anyway?” Or “what is the correlation between the map and the territory?” Nor anything like that. Just “what is your brain doing when you think you are reasoning?” And “is what an LLM does comparable?

Philosophy doesn’t have answers for questions of applied psychology.

phailhaus 20 hours ago [-]
Feels like, but isn't. When you are reasoning things out, there is a brain with state that is actively modeling the problem. AI does no such thing, it produces text and then uses that text to condition the next text. If it isn't written, it does not exist.

Put another way, LLMs are good at talking like they are thinking. That can get you pretty far, but it is not reasoning.

Enginerrrd 19 hours ago [-]
The transformer architecture absolutely keeps state information "in its head" so to speak as it produces the next word prediction, and uses that information in its compute.

It's true that if it's not producing text, there is no thinking involved, but it is absolutely NOT clear that the attention block isn't holding state and modeling something as it works to produce text predictions. In fact, I can't think of a way to define it that would make that untrue... unless you mean that there isn't a system wherein something like attention is updating/computing and the model itself chooses when to make text predictions. That's by design, but what you're arguing doesn't really follow.

Now, whether what the model is thinking about inside that attention block matches up exactly or completely with the text it's producing as generated context is probably at least a little dubious, and its unlikely to be a complete representation regardless.

dmacfour 18 hours ago [-]
> The transformer architecture absolutely keeps state information "in its head" so to speak as it produces the next word prediction, and uses that information in its compute.

How so? Transformers are state space models.

double0jimb0 20 hours ago [-]
So exactly what language/paradigm is this brain modeling the problem within?
phailhaus 20 hours ago [-]
We literally don't know. We don't understand how the brain stores concepts. It's not necessarily language: there are people that do not have an internal monologue, and yet they are still capable of higher level thinking.
chrisweekly 20 hours ago [-]
Rilke: "There is a depth of thought untouched by words, and deeper still a depth of formless feeling untouched by thought."
adastra22 9 hours ago [-]
I am one of those for what it’s worth. I struggle to put my thoughts into words, as it does not come naturally to me. When I think internally, I do not use language at all, a lot of of the time.

Drive my wife crazy as my answer to her questions are always really slow and considered. I have the first think what thought I want to convey, and then think “how do I translate this into words?”

mdp2021 20 hours ago [-]
But a big point here becomes whether the generated "context" then receives proper processing.
adastra22 9 hours ago [-]
What processing? When you have an internal line of thought, what processing do you do on it?

For me, it feels like I say something, and in saying it, and putting it into words, I have a feeling about whether it is true and supported or not. A qualitative gauge of its correctness. A lot of my reasoning is done this way, trusting that these feelings are based off of a lifetime experience of accumulated facts and the another things currently being considered.

Explain to me how this is different than a neural net outputting a weight for the truthiness of the state space vector?

mdp2021 6 hours ago [-]
I meant the LLM. I meant, with «whether the generated "context" then receives proper processing», whether the CoT generated by the LLM and here framed as further "context", regardless here of how properly it is generated, receives adequate processing by the internals of the NN.

A good context (any good context) does not necessarily lead to a good output in LLMs. (It does necessarily lead to a better output, but not necessarily a satisfying, proper, decent, consequential one.)

slashdave 20 hours ago [-]
Perhaps we can find some objective means to decide, rather than go with what "feels" correct
adastra22 9 hours ago [-]
That’s not how our brains work though, or how must examples of human reasoning play out. When asking “do LLMs reason” we are asking whether the action being performed is similar to regular humans, not some platonic ideal of a scientist/rationalist.
mdp2021 6 hours ago [-]
> When asking “do LLMs reason” we are asking whether the action being performed is similar to

Very certainly not. We ask if the system achieves the goal.

"When we ask if the coprocessor performs floating point arithmetic, we ask if the system achieves the goal (of getting accurate results)". Not, "does the co-processor ask if we have a spare napkin".

benreesman 19 hours ago [-]
People will go to extremely great lengths to debate the appropriate analogy for how these things work, which is fun I guess but in a "get high with a buddy" sense at least to my taste.

Some of how they work is well understood (a lot now, actually), some of the outcomes are still surprising.

But we debate both the well understood parts and the surprising parts both with the wrong terminology borrowed from pretty dubious corners of pop cognitive science, and not with terminology appropriate to the new and different thing! It's nothing like a brain, it's a new different thing. Does it think or reason? Who knows pass the blunt.

They do X performance on Y task according to Z eval, that's how you discuss ML model capability if you're persuing understanding rather than fundraising or clicks.

Vegenoid 17 hours ago [-]
While I largely agree with you, more abstract judgements must be made as the capabilities (and therefore tasks being completed) become increasingly general. Attempts to boil human intellectual capability down to "X performance on Y task according to Z eval" can be useful, but are famously incomplete and insufficient on their own for making good decisions about which humans (a.k.a. which general intelligences) are useful and how to utilize and improve them. Boiling down highly complex behavior into a small number of metrics loses a lot of detail.

There is also the desire to discover why a model that outperforms others does so, so that the successful technique can be refined and applied elsewhere. This too usually requires more approaches than metric comparison.

ofjcihen 19 hours ago [-]
It’s incredible to me that so many seem to have fallen for “humans are just LLMs bruh” argument but I think I’m beginning to understand the root of the issue.

People who only “deeply” study technology only have that frame of reference to view the world so they make the mistake of assuming everything must work that way, including humans.

If they had a wider frame of reference that included, for example, Early Childhood Development, they might have enough knowledge to think outside of this box and know just how ridiculous that argument is.

gond 18 hours ago [-]
That is an issue prevalent in the western world for the last 200 years, beginning possibly with the Industrial Revolution, probably earlier. That problem is reductionism, consequently applied down to the last level: discover the smallest element of every field of science, develop an understanding of all the parts from the smallest part upwards and develop, from the understanding of the parts, an understanding of the whole.

Unfortunately, this approach does not yield understanding, it yields know-how.

Kim_Bruning 14 hours ago [-]
Taking things apart to see how they tick is called reduction, but (re)assembling the parts is emergence.

When you reduce something to its components, you lose information on how the components work together. Emergence 'finds' that information back.

Compare differentiation and integration, which lose and gain terms respectively.

In some cases, I can imagine differentiating and integrating certain functions actually would even be a direct demonstration of reduction and emergence.

dmacfour 18 hours ago [-]
I have a background in ML and work in software development, but studied experimental psych in a past life. It's actually kind of painful watching people slap phases related to cognition onto things that aren't even functionally equivalent to their namesakes, then parade them around like some kind of revelation. It's also a little surprising that there no interest (at least publicly) in using cognitive architectures in the development of AI systems.
cyanydeez 19 hours ago [-]
They should call them Fuzzing models. They're just running through varioous iterations of the context until they hit a token that trips them out.
bongodongobob 20 hours ago [-]
And yet it improves their problem solving ability.
skybrian 20 hours ago [-]
Mathematical reasoning does sometimes require correct calculations, and if you get them wrong your answers will be wrong. I wouldn’t want someone doing my taxes to be bad at calculation or bad at finding mistakes in calculation.

It would be interesting to see if this study’s results can be reproduced in a more realistic setting.

mentalgear 21 hours ago [-]
> Whether AI reasoning is “real” reasoning or just a mirage can be an interesting question, but it is primarily a philosophical question. It depends on having a clear definition of what “real” reasoning is, exactly.

It's pretty easy: causal reasoning. Causal, not statistic correlation only as LLM do, with or without "CoT".

glial 21 hours ago [-]
Correct me if I'm wrong, I'm not sure it's so simple. LLMs are called causal models in the sense that earlier tokens "cause" later tokens, that is, later tokens are causally dependent on what the earlier tokens are.

If you mean deterministic rather than probabilistic, even Pearl-style causal models are probabilistic.

I think the author is circling around the idea that their idea of reasoning is to produce statements in a formal system: to have a set of axioms, a set of production rules, and to generate new strings/sentences/theorems using those rules. This approach is how math is formalized. It allows us to extrapolate - make new "theorems" or constructions that weren't in the "training set".

jayd16 21 hours ago [-]
By this definition a bag of answers is causal reasoning because we previously filled the bag, which caused what we pulled. State causing a result is not causal reasoning.

You need to actually have something that deduces a result from a set of principles that form a logical conclusion or the understanding that more data is needed to make a conclusion. That is clearly different than finding a likely next token on statics alone, despite the fact the statical answer can be correct.

apples_oranges 21 hours ago [-]
But let's say you change your mathematical expression by reducing or expanding it somehow, then, unless it's trivial, there are infinite ways to do it, and the "cause" here is the answer to the question of "why did you do that and not something else"? Brute force excluded, the cause is probably some idea, some model of the problem or a gut feeling (or desperation..) ..
stonemetal12 20 hours ago [-]
Smoking increases the risk of getting cancer significantly. We say Smoking causes Cancer. Causal reasoning can be probabilistic.

LLMs are not causal reasoning because there are no facts, only tokens. For the most part you can't ask LLMs how they came to an answer, because it doesn't know.

lordnacho 21 hours ago [-]
What's stopping us from building an LLM that can build causal trees, rejecting some trees and accepting others based on whatever evidence it is fed?

Or even a causal tool for an LLM agent that operates like what it does when you ask it about math and forwards the request to Wolfram.

blackbear_ 5 hours ago [-]
In principle this is possible, modulo scalability concerns: https://arxiv.org/pdf/2506.06039

Perhaps this will one day become a new post-training task

suddenlybananas 20 hours ago [-]
>What's stopping us from building an LLM that can build causal trees, rejecting some trees and accepting others based on whatever evidence it is fed?

Exponential time complexity.

mdp2021 20 hours ago [-]
> causal reasoning

You have missed the foundation: before dynamics, being. Before causal reasoning you have deep definition of concepts. Causality is "below" that.

naasking 21 hours ago [-]
Define causal reasoning?
slashdave 20 hours ago [-]
> reasoning probably requires language use

The author has a curious idea of what "reasoning" entails.

robviren 21 hours ago [-]
I feel it is interesting but not what would be ideal. I really think if the models could be less linear and process over time in latent space you'd get something much more akin to thought. I've messed around with attaching reservoirs at each layer using hooks with interesting results (mainly over fitting), but it feels like such a limitation to have all model context/memory stuck as tokens when latent space is where the richer interaction lives. Would love to see more done where thought over time mattered and the model could almost mull over the question a bit before being obligated to crank out tokens. Not an easy problem, but interesting.
dkersten 21 hours ago [-]
Agree! I’m not an AI engineer or researcher, but it always struck me as odd that we would serialise the 100B or whatever parameters of latent space down to maximum 1M tokens and back for every step.
CuriouslyC 20 hours ago [-]
They're already implementing branching thought and taking the best one, eventually the entire response will be branched, with branches being spawned and culled by some metric over the lifetime of the completion. It's just not feasible now for performance reasons.
vonneumannstan 21 hours ago [-]
>I feel it is interesting but not what would be ideal. I really think if the models could be less linear and process over time in latent space you'd get something much more akin to thought.

Please stop, this is how you get AI takeovers.

adastra22 20 hours ago [-]
Citation seriously needed.
varelse 20 hours ago [-]
[dead]
moc_was_wronged 20 hours ago [-]
Mostly. It gives language models the way to dynamically allocate computation time, but the models are still fundamentally imitative.
baby 7 hours ago [-]
My take on that is that it's a way to bring more relevant tokens in context, to influence the final answer. It's a bit like RAG but it's using training data instead!
sixdimensional 20 hours ago [-]
I feel like the fundamental concept of symbolic logic[1] as a means of reasoning fits within the capabilities of LLMs.

Whether it's a mirage or not, the ability to produce a symbolically logical result that has valuable meaning seems real enough to me.

Especially since most meaning is assigned by humans onto the world... so too can we choose to assign meaning (or not) to the output of a chain of symbolic logic processing?

Edit: maybe it is not so much that an LLM calculates/evaluates the result of symbolic logic as it is that it "follows" the pattern of logic encoded into the model.

[1] https://en.wikipedia.org/wiki/Logic

mucho_mojo 21 hours ago [-]
This paper I read from here has an interesting mathematical model for reasoning based on cognitive science. https://arxiv.org/abs/2506.21734 (there is also code here https://github.com/sapientinc/HRM) I think we will see dramatic performance increases on "reasoning" problems when this is worked into existing AI architectures.
hannasm 5 hours ago [-]

  > these papers keep stapling on broad philosophical claims about whether models can “really reason” that are just completely unsupported by the content of the research.
From the scientific papers I've read almost every single research paper does this. What's the point of publishing a paper if it doesn't at least try to convince the readers that something award worthy has been learned?

Usually there may be some interesting ideas hidden in the data but the paper's methods and scope weren't even worthy of a conclusion to begin with. It's just one data point in the vast sea of scientific experimentation.

The conclusion feels to me like a cultural phenomenon and it's just a matter of survival for most authors. I have to imagine it was easier in the past.

"Does the flame burn green? Why yes it does..."

These days it's more like

"With my two hours of compute on the million dollar mainframe, my toy llm didn't seem to get there, YMMV"

dawnofdusk 17 hours ago [-]
>but we know that reasoning is an emergent capability!

This is like saying in the 70s that we know only the US is capable of sending a man to the moon. Just because the reasoning developed in a particular context means very little about what the bare minimum requirements for that reasoning are.

Overall I am not a fan of this blogpost. It's telling how long the author gets hung up on a paper making "broad philosophical claims about reasoning", based on what reads to me as fairly typical scientific writing style. It's also telling how highly cherry-picked the quotes they criticize from the paper are. Here is some fuller context:

>An expanding body of analyses reveals that LLMs tend to rely on surface-level semantics and cluesrather than logical procedures (Chen et al., 2025b; Kambhampati, 2024; Lanham et al., 2023; Stechly et al., 2024). LLMs construct superficial chains of logic based on learned token associations, often failing on tasks that deviate from commonsense heuristics or familiar templates (Tang et al., 2023). In the reasoning process, performance degrades sharply when irrelevant clauses are introduced, which indicates that models cannot grasp the underlying logic (Mirzadeh et al., 2024)

>Minor and semantically irrelevant perturbations such as distractor phrases or altered symbolic forms can cause significant performance drops in state-of-the-art models (Mirzadeh et al., 2024; Tang et al., 2023). Models often incorporate such irrelevant details into their reasoning, revealing a lack of sensitivity to salient information. Other studies show that models prioritize the surface form of reasoning over logical soundness; in some cases, longer but flawed reasoning paths yield better final answers than shorter, correct ones (Bentham et al., 2024). Similarly, performance does not scale with problem complexity as expected—models may overthink easy problems and give up on harder ones (Shojaee et al., 2025). Another critical concern is the faithfulness of the reasoning process. Intervention-based studies reveal that final answers often remain unchanged even when intermediate steps are falsified or omitted (Lanham et al., 2023), a phenomenon dubbed the illusion of transparency (Bentham et al., 2024; Chen et al., 2025b).

You don't need to be a philosopher to realize that these problems seem quite distinct from the problems with human reasoning. For example, "final answers remain unchanged even when intermediate steps are falsified or omitted"... can humans do this?

jongjong 2 hours ago [-]
Yes, CoT reasoning is a mirage. What's actually happening is that we've all been brainwashed by Facebook/Meta to be hyper-predictable such that whenever we ask the AI something, it already had a prepared answer for that question. Because Meta already programmed us to ask the AI those exact questions.

There is no AI, it's just a dumb database which maps a person ID and timestamp to a static piece of content. The hard part was brainwashing us to ask the questions which correspond to the answers that they had already prepared.

Probably there is a super intelligent AI behind the scenes which brainwashed us all but we never actually interact with it. It outsmarted us so fast and so badly, it left us all literally talking to excel spreadsheets and convinced us that the spreadsheets were intelligent; that's why LLMs are so cheap and can scale so well. It's not difficult to scale a dumb key-value store doing a simple O(log n) lookup operation.

The ASI behind this realized it was more efficient to do it this way rather than try to scale a real LLM to millions of users.

19 hours ago [-]
guybedo 11 hours ago [-]
lots of interesting comments and ideas here.

I've added a summary: https://extraakt.com/extraakts/debating-the-nature-of-ai-rea...

doku 9 hours ago [-]
Did the original paper show that the toy model was fully grokked?
guluarte 10 hours ago [-]
I would call it more like prompt refinement.
ForHackernews 4 hours ago [-]
Didn't Anthropic show that LLMs frequently hallucinate their "reasoning" steps?

> Bullshitting (Unfaithful): The model gives the wrong answer. The computation we can see looks like it’s just guessing the answer, despite the chain of thought suggesting it’s computed it using a calculator.

https://transformer-circuits.pub/2025/attribution-graphs/bio...

lawrence1 20 hours ago [-]
we should be asking if reasoning while speaking is even possible for humans. this is why we have the scientific method and that's why LLMs write and run unit tests on their reasoning. But yeah intelligence is probably in the ear of the believer.
21 hours ago [-]
jrm4 18 hours ago [-]
Current thought, for me there's a lot of hand-wringing about what is "reasoning" and what isn't. But right now perhaps the question might be boiled down to -- "is the bottleneck merely hard drive space/memory/computing speed?"

I kind of feel like we won't be able to even begin to test this until a few more "Moore's law" cycles.

naasking 21 hours ago [-]
> Because reasoning tasks require choosing between several different options. “A B C D [M1] -> B C D E” isn’t reasoning, it’s computation, because it has no mechanism for thinking “oh, I went down the wrong track, let me try something else”. That’s why the most important token in AI reasoning models is “Wait”. In fact, you can control how long a reasoning model thinks by arbitrarily appending “Wait” to the chain-of-thought. Actual reasoning models change direction all the time, but this paper’s toy example is structurally incapable of it.

I think this is the most important critique that undercuts the paper's claims. I'm less convinced by the other point. I think backtracking and/or parallel search is something future papers should definitely look at in smaller models.

The article is definitely also correct on the overreaching, broad philosophical claims that seems common when discussing AI and reasoning.

gshulegaard 17 hours ago [-]
> but we know that reasoning is an emergent capability!

Do we though? There is widespread discussion and growing momentum of belief in this, but I have yet to see conclusive evidence of this. That is, in part, why the subject paper exists...it seeks to explore this question.

I think the author's bias is bleeding fairly heavily into his analysis and conclusions:

> Whether AI reasoning is “real” reasoning or just a mirage can be an interesting question, but it is primarily a philosophical question. It depends on having a clear definition of what “real” reasoning is, exactly.

I think it's pretty obvious that the researchers are exploring whether or not LLMs exhibit evidence of _Deductive_ Reasoning [1]. The entire experiment design reflects this. Claiming that they haven't defined reasoning and therefore cannot conclude or hope to construct a viable experiment is...confusing.

The question of whether or not an LLM can take a set of base facts and compose them to solve a novel/previously unseen problem is interesting and what most people discussing emergent reasoning capabilities of "AI" are tacitly referring to (IMO). Much like you can be taught algebraic principles and use them to solve for "x" in equations you have never seen before, can an LLM do the same?

To which I find this experiment interesting enough. It presents a series of facts and then presents the LLM with tasks to see if it can use those facts in novel ways not included in the training data (something a human might reasonably deduce). To which their results and summary conclusions are relevant, interesting, and logically sound:

> CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.

> The ability of LLMs to produce “fluent nonsense”—plausible but logically flawed reasoning chains—can be more deceptive and damaging than an outright incorrect answer, as it projects a false aura of dependability.

That isn't to say LLMs aren't useful, just exploring it's boundaries. To use legal services as an example, using an LLM to summarize or search for relevant laws, cases, or legal precedent is something it would excel at. But don't ask an LLM to formulate a logical rebuttal to an opposing council's argument using those references.

Larger models and larger training corpuses will expand that domain and make it more difficult for individuals to discern this limit; but just because you can no longer see a limit doesn't mean there is none.

And to be clear, this doesn't diminish the value of LLMs. Even without true logical reasoning LLMs are quite powerful and useful tools.

[1] https://en.wikipedia.org/wiki/Logical_reasoning

j45 18 hours ago [-]
Currently it feels like it's more simulated chain-of-thought / reasoning, sometimes very consistent, but simulated, partially because it's statistically generated and non-deterministic (not the exact same path to the similar or same each response run).
skywhopper 19 hours ago [-]
I mostly agree with the point the author makes that "it doesn't matter". But then again, it does matter, because LLM-based products are marketed based on "IT CAN REASON!" And so, while it may not matter, per se, how an LLM comes up with its results, to the extent that people choose to rely on LLMs because of marketing pitches, it's worth pushing back on those claims if they are overblown, using the same frame that the marketers use.

That said, this author says this question of whether models "can reason" is the least interesting thing to ask. But I think the least interesting thing you can do is to go around taking every complaint about LLM performance and saying "but humans do the exact same thing!" Which is often not true, but again, doesn't matter.

empath75 21 hours ago [-]
One thing that LLMs have exposed is how much of a house of cards all of our definitions of "human mind"-adjacent concepts are. We have a single example in all of reality of a being that thinks like we do, and so all of our definitions of thinking are inextricably tied with "how humans think", and now we have an entity that does things which seem to be very like how we think, but not _exactly like it_, and a lot of our definitions don't seem to work any more:

Reasoning, thinking, knowing, feeling, understanding, etc.

Or at the very least, our rubrics and heuristics for determining if someone (thing) thinks, feels, knows, etc, no longer work. And in particular, people create tests for those things thinking that they understand what they are testing for, when _most human beings_ would also fail those tests.

I think a _lot_ of really foundational work needs to be done on clearly defining a lot of these terms and putting them on a sounder basis before we can really move forward on saying whether machines can do those things.

gilbetron 18 hours ago [-]
I agree 100% with you. I'm most excited about LLMs because they seem to capture at least some aspect of intelligence, and that's amazing given how much long it took to get here. It's exciting that we don't just understand it.

I see people say, "LLMs aren't human intelligence", but instead, I really feel that it shows that many people, and much of what we do, probably is like an LLM. Most people just hallucinate their way through a conversation, they certainly don't reason. Reasoning is incredibly rare.

gdbsjjdn 21 hours ago [-]
Congratulations, you've invented philosophy.
meindnoch 20 hours ago [-]
We need to reinvent philosophy. With JSON this time.
empath75 21 hours ago [-]
This is an obnoxious response. Of course I recognize that philosophy is the solution to this. What I am pointing out is that philosophy has not as of yet resolved these relatively new problems. The idea that non-human intelligences might exist is of course an old one, but that is different from having an actual (potentially) existing one to reckon with.
gdbsjjdn 20 hours ago [-]
> Writings on metacognition date back at least as far as two works by the Greek philosopher Aristotle (384–322 BC): On the Soul and the Parva Naturalia

We built a box that spits out natural language and tricks humans into believing it's conscious. The box itself actually isn't that interesting, but the human side of the equation is.

mdp2021 19 hours ago [-]
> the human side of the equation is

You have only proven the urgency of Intelligence, the need to produce it in inflationary amounts.

deadbabe 20 hours ago [-]
Non-human intelligences have always existed in the form of animals.

Animals do not have spoken language the way humans do, so their thoughts aren’t really composed of sentences. Yet, they have intelligence and can reason about their world.

How could we build an AGI that doesn’t use language to think at all? We have no fucking clue and won’t for a while because everyone is chasing the mirage created by LLMs. AI winter will come and we’ll sit around waiting for the next big innovation. Probably some universal GOAP with deeply recurrent neural nets.

adastra22 20 hours ago [-]
These are not new problems though.
mdp2021 20 hours ago [-]
> which seem to be very like how we think

I would like to reassure you that we - we here - see LLMs are very much unlike us.

empath75 20 hours ago [-]
Yes I very much understand that most people do not think that LLMs think or understand like we do, but it is _very difficult_ to prove that that is the case, using any test which does not also exclude a great deal of people. And that is because "thinking like we do" is not at all a well-defined concept.
mdp2021 20 hours ago [-]
> exclude a great deal of people

And why should you not exclude them. Where does this idea come from, taking random elements as models. Where do you see pedestals of free access? Is the Nobel Prize a raffle now?

cess11 18 hours ago [-]
Yes, it's a mirage, since this type of software is an opaque simulation, perhaps even a simulacra. It's reasoning in the same sense as there are terrorists in a game of Counter-Strike.
sempron64 21 hours ago [-]
[flagged]
mwkaufma 21 hours ago [-]
Betteridge's law applies to editors adding question marks to cover-the-ass of articles with weak claims, not bloggers begging questions.
pinoy420 20 hours ago [-]
[dead]