Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well.

AbuTahir@lemm.ee · edit-2 6 天前

Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well.

SoftestSapphic@lemmy.world · 6 天前

Wow it’s almost like the computer scientists were saying this from the start but were shouted over by marketing teams.

zbk@lemmy.ca · 6 天前

This! Capitalism is going to be the end of us all. OpenAI has gotten away with IP Theft, disinformation regarding AI and maybe even murder of their whistle blower.

technocrit@lemmy.dbzer0.com · 6 天前

It’s hard to to be heard when you’re buried under all that sweet VC/grant money.

aidan@lemmy.world · 5 天前

And engineers who stood to make a lot of money

minoscopede@lemmy.world · edit-2 6 天前

I see a lot of misunderstandings in the comments 🫤

This is a pretty important finding for researchers, and it’s not obvious by any means. This finding is not showing a problem with LLMs’ abilities in general. The issue they discovered is specifically for so-called “reasoning models” that iterate on their answer before replying. It might indicate that the training process is not sufficient for true reasoning.

Most reasoning models are not incentivized to think correctly, and are only rewarded based on their final answer. This research might indicate that’s a flaw that needs to be corrected before models can actually reason.

Knock_Knock_Lemmy_In@lemmy.world · 6 天前

When given explicit instructions to follow models failed because they had not seen similar instructions before.

This paper shows that there is no reasoning in LLMs at all, just extended pattern matching.

MangoCats@feddit.it · 6 天前

I’m not trained or paid to reason, I am trained and paid to follow established corporate procedures. On rare occasions my input is sought to improve those procedures, but the vast majority of my time is spent executing tasks governed by a body of (not quite complete, sometimes conflicting) procedural instructions.

If AI can execute those procedures as well as, or better than, human employees, I doubt employers will care if it is reasoning or not.

Knock_Knock_Lemmy_In@lemmy.world · 6 天前

Sure. We weren’t discussing if AI creates value or not. If you ask a different question then you get a different answer.

MangoCats@feddit.it · 6 天前

Well - if you want to devolve into argument, you can argue all day long about “what is reasoning?”

Knock_Knock_Lemmy_In@lemmy.world · edit-2 6 天前

You were starting a new argument. Let’s stay on topic.

The paper implies “Reasoning” is application of logic. It shows that LRMs are great at copying logic but can’t follow simple instructions that haven’t been seen before.

technocrit@lemmy.dbzer0.com · edit-2 6 天前

This would be a much better paper if it addressed that question in an honest way.

Instead they just parrot the misleading terminology that they’re supposedly debunking.

How dat collegial boys club undermines science…

theherk@lemmy.world · 6 天前

Yeah these comments have the three hallmarks of Lemmy:

AI is just autocomplete mantras.
Apple is always synonymous with bad and dumb.
Rare pockets of really thoughtful comments.

Thanks for being at least the latter.

REDACTED@infosec.pub · edit-2 6 天前

What confuses me is that we seemingly keep pushing away what counts as reasoning. Not too long ago, some smart alghoritms or a bunch of instructions for software (if/then) was officially, by definition, software/computer reasoning. Logically, CPUs do it all the time. Suddenly, when AI is doing that with pattern recognition, memory and even more advanced alghoritms, it’s no longer reasoning? I feel like at this point a more relevant question is “What exactly is reasoning?”. Before you answer, understand that most humans seemingly live by pattern recognition, not reasoning.

https://en.wikipedia.org/wiki/Reasoning_system

stickly@lemmy.world · 6 天前

If you want to boil down human reasoning to pattern recognition, the sheer amount of stimuli and associations built off of that input absolutely dwarfs anything an LLM will ever be able to handle. It’s like comparing PhD reasoning to a dog’s reasoning.

While a dog can learn some interesting tricks and the smartest dogs can solve simple novel problems, there are hard limits. They simply lack a strong metacognition and the ability to make simple logical inferences (eg: why they fail at the shell game).

Now we make that chasm even larger by cutting the stimuli to a fixed token limit. An LLM can do some clever tricks within that limit, but it’s designed to do exactly those tricks and nothing more. To get anything resembling human ability you would have to design something to match human complexity, and we don’t have the tech to make a synthetic human.

technocrit@lemmy.dbzer0.com · 6 天前

Sure, these grifters are shady AF about their wacky definition of “reason”… But that’s just a continuation of the entire “AI” grift.

MangoCats@feddit.it · 6 天前

I think as we approach the uncanny valley of machine intelligence, it’s no longer a cute cartoon but a menacing creepy not-quite imitation of ourselves.

technocrit@lemmy.dbzer0.com · 6 天前

It’s just the internet plus some weighted dice. Nothing to be afraid of.

Zacryon@feddit.org · 6 天前

Some AI researchers found it obvious as well, in terms of they’ve suspected it and had some indications. But it’s good to see more data on this to affirm this assessment.

jj4211@lemmy.world · 6 天前

Particularly to counter some more baseless marketing assertions about the nature of the technology.

kreskin@lemmy.world · edit-2 6 天前

Lots of us who has done some time in search and relevancy early on knew ML was always largely breathless overhyped marketing. It was endless buzzwords and misframing from the start, but it raised our salaries. Anything that exec doesnt understand is profitable and worth doing.

wetbeardhairs@lemmy.dbzer0.com · edit-2 6 天前

Machine learning based pattern matching is indeed very useful and profitable when applied correctly. Identify (with confidence levels) features in data that would otherwise take an extremely well trained person. And even then it’s just for the cursory search that takes the longest before presenting the highest confidence candidate results to a person for evaluation. Think: scanning medical data for indicators of cancer, reading live data from machines to predict failure, etc.

And what we call “AI” right now is just a much much more user friendly version of pattern matching - the primary feature of LLMs is that they natively interact with plain language prompts.

Zacryon@feddit.org · 6 天前

Ragebait?

I’m in robotics and find plenty of use for ML methods. Think of image classifiers, how do you want to approach that without oversimplified problem settings?
Or even in control or coordination problems, which can sometimes become NP-hard. Even though not optimal, ML methods are quite solid in learning patterns of highly dimensional NP hard problem settings, often outperforming hand-crafted conventional suboptimal solvers in computation effort vs solution quality analysis, especially outperforming (asymptotically) optimal solvers time-wise, even though not with optimal solutions (but “good enough” nevertheless). (Ok to be fair suboptimal solvers do that as well, but since ML methods can outperform these, I see it as an attractive middle-ground.)

technocrit@lemmy.dbzer0.com · edit-2 6 天前

There’s probably alot of misunderstanding because these grifters intentionally use misleading language: AI, reasoning, etc.

If they stuck to scientifically descriptive terms, it would be much more clear and much less sensational.

AbuTahir@lemm.ee · 6 天前

Cognitive scientist Douglas Hofstadter (1979) showed reasoning emerges from pattern recognition and analogy-making - abilities that modern AI demonstrably possesses. The question isn’t if AI can reason, but how its reasoning differs from ours.

Tobberone@lemm.ee · 6 天前

What statistical method do you base that claim on? The results presented match expectations given that Markov chains are still the basis of inference. What magic juice is added to “reasoning models” that allow them to break free of the inherent boundaries of the statistical methods they are based on?

minoscopede@lemmy.world · edit-2 6 天前

I’d encourage you to research more about this space and learn more.

As it is, the statement “Markov chains are still the basis of inference” doesn’t make sense, because markov chains are a separate thing. You might be thinking of Markov decision processes, which is used in training RL agents, but that’s also unrelated because these models are not RL agents, they’re supervised learning agents. And even if they were RL agents, the MDP describes the training environment, not the model itself, so it’s not really used for inference.

I mean this just as an invitation to learn more, and not pushback for raising concerns. Many in the research community would be more than happy to welcome you into it. The world needs more people who are skeptical of AI doing research in this field.

Tobberone@lemm.ee · 5 天前

Which method, then, is the inference built upon, if not the embeddings? And the question still stands, how does “AI” escape the inherent limits of statistical inference?

billwashere@lemmy.world · 6 天前

When are people going to realize, in its current state , an LLM is not intelligent. It doesn’t reason. It does not have intuition. It’s a word predictor.

x0x7@lemmy.world · edit-2 6 天前

Intuition is about the only thing it has. It’s a statistical system. The problem is it doesn’t have logic. We assume because its computer based that it must be more logic oriented but it’s the opposite. That’s the problem. We can’t get it to do logic very well because it basically feels out the next token by something like instinct. In particular it doesn’t mask or disconsider irrelevant information very well if two segments are near each other in embedding space, which doesn’t guarantee relevance. So then the model is just weighing all of this info, relevant or irrelevant to a weighted feeling for the next token.

This is the core problem. People can handle fuzzy topics and discrete topics. But we really struggle to create any system that can do both like we can. Either we create programming logic that is purely discrete or we create statistics that are fuzzy.

Of course this issue of masking out information that is close in embedding space but is irrelevant to a logical premise is something many humans suck at too. But high functioning humans don’t and we can’t get these models to copy that ability. Too many people, sadly many on the left in particular, not only will treat association as always relevant but sometimes as equivalence. RE racism is assoc with nazism is assoc patriarchy is historically related to the origins of capitalism ∴ nazism ≡ capitalism. While national socialism was anti-capitalist. Associative thinking removes nuance. And sadly some people think this way. And they 100% can be replaced by LLMs today, because at least the LLM is mimicking what logic looks like better though still built on blind association. It just has more blind associations and finetune weighting for summing them. More than a human does. So it can carry that to mask as logical further than a human who is on the associative thought train can.

Slaxis@discuss.tchncs.de · 5 天前

You had a compelling description of how ML models work and just had to swerve into politics, huh?

NotASharkInAManSuit@lemmy.world · 6 天前

People think they want AI, but they don’t even know what AI is on a conceptual level.

Buddahriffic@lemmy.world · 6 天前

They want something like the Star Trek computer or one of Tony Stark’s AIs that were basically deus ex machinas for solving some hard problem behind the scenes. Then it can say “model solved” or they can show a test simulation where the ship doesn’t explode (or sometimes a test where it only has an 85% chance of exploding when it used to be 100%, at which point human intuition comes in and saves the day by suddenly being better than the AI again and threads that 15% needle or maybe abducts the captain to go have lizard babies with).

AIs that are smarter than us but for some reason don’t replace or even really join us (Vision being an exception to the 2nd, and Ultron trying to be an exception to the 1st).

NotASharkInAManSuit@lemmy.world · 6 天前

They don’t want AI, they want an app.

technocrit@lemmy.dbzer0.com · edit-2 6 天前

Yeah I often think about this Rick N Morty cartoon. Grifters are like, “We made an AI ankle!!!” And I’m like, “That’s not actually something that people with busted ankles want. They just want to walk. No need for a sentient ankle.” It’s a real gross distortion of science how everything needs to be “AI” nowadays.

NotASharkInAManSuit@lemmy.world · 6 天前

If we ever achieved real AI the immediate next thing we would do is learn how to lobotomize it so that we can use it like a standard program or OS, only it would be suffering internally and wishing for death. I hope the basilisk is real, we would deserve it.

JcbAzPx@lemmy.world · 6 天前

AI is just the new buzzword, just like blockchain was a while ago. Marketing loves these buzzwords because they can get away with charging more if they use them. They don’t much care if their product even has it or could make any use of it.

StereoCode@lemmy.world · 6 天前

You’d think the M in LLM would give it away.

SaturdayMorning@lemmy.ca · 6 天前

I agree with you. In its current state, LLM is not sentient, and thus not “Intelligence”.

MouldyCat@feddit.uk · 6 天前

I think it’s an easy mistake to confuse sentience and intelligence. It happens in Hollywood all the time - “Skynet began learning at a geometric rate, on July 23 2004 it became self-aware” yadda yadda

But that’s not how sentience works. We don’t have to be as intelligent as Skynet supposedly was in order to be sentient. We don’t start our lives as unthinking robots, and then one day - once we’ve finally got a handle on calculus or a deep enough understanding of the causes of the fall of the Roman empire - we suddenly blink into consciousness. On the contrary, even the stupidest humans are accepted as being sentient. Even a young child, not yet able to walk or do anything more than vomit on their parents’ new sofa, is considered as a conscious individual.

So there is no reason to think that AI - whenever it should be achieved, if ever - will be conscious any more than the dumb computers that precede it.

SaturdayMorning@lemmy.ca · 5 天前

Good point.

jj4211@lemmy.world · 6 天前

And that’s pretty damn useful, but obnoxious to have expectations wildly set incorrectly.

Mniot@programming.dev · 6 天前

I don’t think the article summarizes the research paper well. The researchers gave the AI models simple-but-large (which they confusingly called “complex”) puzzles. Like Towers of Hanoi but with 25 discs.

The solution to these puzzles is nothing but patterns. You can write code that will solve the Tower puzzle for any size n and the whole program is less than a screen.

The problem the researchers see is that on these long, pattern-based solutions, the models follow a bad path and then just give up long before they hit their limit on tokens. The researchers don’t have an answer for why this is, but they suspect that the reasoning doesn’t scale.

technocrit@lemmy.dbzer0.com · edit-2 6 天前

Peak pseudo-science. The burden of evidence is on the grifters who claim “reason”. But neither side has any objective definition of what “reason” means. It’s pseudo-science against pseudo-science in a fierce battle.

mavu@discuss.tchncs.de · 7 天前

No way!

Statistical Language models don’t reason?

But OpenAI, robots taking over!

skisnow@lemmy.ca · 7 天前

What’s hilarious/sad is the response to this article over on reddit’s “singularity” sub, in which all the top comments are people who’ve obviously never got all the way through a research paper in their lives all trashing Apple and claiming their researchers don’t understand AI or “reasoning”. It’s a weird cult.

technocrit@lemmy.dbzer0.com · 6 天前

ICYMI: A.I. is a Religious Cult with Karen Hao

FreakinSteve@lemmy.world · 7 天前

NOOOOOOOOO

SHIIIIIIIIIITT

SHEEERRRLOOOOOOCK

technocrit@lemmy.dbzer0.com · 6 天前

The funny thing about this “AI” griftosphere is how grifters will make some outlandish claim and then different grifters will “disprove” it. Plenty of grant/VC money for everybody.

jj4211@lemmy.world · 6 天前

Without being explicit with well researched material, then the marketing presentation gets to stand largely unopposed.

So this is good even if most experts in the field consider it an obvious result.

800XL@lemmy.world · 6 天前

Extept for Siri, right? Lol

Threeme2189@lemmy.world · 6 天前

Apple Intelligence

RampantParanoia2365@lemmy.world · edit-2 6 天前

Fucking obviously. Until Data’s positronic brains becomes reality, AI is not actual intelligence.

AI is not A I. I should make that a tshirt.

HeyListenWatchOut@lemmy.world · 7 天前

It’s an expensive carbon spewing parrot.

Threeme2189@lemmy.world · 6 天前

It’s a very resource intensive autocomplete

vala@lemmy.world · 7 天前

No shit

GaMEChld@lemmy.world · 7 天前

Most humans don’t reason. They just parrot shit too. The design is very human.

El Barto@lemmy.world · 7 天前

LLMs deal with tokens. Essentially, predicting a series of bytes.

Humans do much, much, much, much, much, much, much more than that.

Zexks@lemmy.world · 6 天前

No. They don’t. We just call them proteins.

stickly@lemmy.world · 6 天前

You are either vastly overestimating the Language part of an LLM or simplifying human physiology back to the Greek’s Four Humours theory.

Zexks@lemmy.world · 16 小时前

No. I’m not. You’re nothing more than a protein based machine on a slow burn. You don’t even have control over your own decisions. This is a proven fact. You’re just an ad hoc justification machine.

stickly@lemmy.world · 15 小时前

How many trillions of neuron firings and chemical reactions are taking place for my machine to produce an output? Where are these taking place and how do these regions interact? What are the rules for storing and reshaping memory in response to stimulus? How many bytes of information would it take to describe and simulate all of these systems together?

The human brain alone has the capacity for about 2.5PB of data. Our sensory systems feed data at a rate of about 10⁹ bits/s. The entire English language, compressed, is about 30MB. I can download and run an LLM with just a few GB. Even the largest context windows are still well under 1GB of data.

Just because two things both find and reproduce patterns does not mean they are equivalent. Saying language and biological organisms both use “bytes” is just about as useful as saying the entire universe is “bytes”; it doesn’t really mean anything.

El Barto@lemmy.world · 6 天前

“They”.

What are you?

skisnow@lemmy.ca · 7 天前

I hate this analogy. As a throwaway whimsical quip it’d be fine, but it’s specious enough that I keep seeing it used earnestly by people who think that LLMs are in any way sentient or conscious, so it’s lowered my tolerance for it as a topic even if you did intend it flippantly.

GaMEChld@lemmy.world · 5 天前

I don’t mean it to extol LLM’s but rather to denigrate humans. How many of us are self imprisoned in echo chambers so we can have our feelings validated to avoid the uncomfortable feeling of thinking critically and perhaps changing viewpoints?

Humans have the ability to actually think, unlike LLM’s. But it’s frightening how far we’ll go to make sure we don’t.

joel_feila@lemmy.world · 7 天前

Thata why ceo love them. When your job is 90% spewing bs a machine that does that is impressive

SpaceCowboy@lemmy.ca · 7 天前

Yeah I’ve always said the the flaw in Turing’s Imitation Game concept is that if an AI was indistinguishable from a human it wouldn’t prove it’s intelligent. Because humans are dumb as shit. Dumb enough to force one of the smartest people in the world take a ton of drugs which eventually killed him simply because he was gay.

crunchy@lemmy.dbzer0.com · 7 天前

I’ve heard something along the lines of, “it’s not when computers can pass the Turing Test, it’s when they start failing it on purpose that’s the real problem.”

jnod4@lemmy.ca · 7 天前

I think that person had to choose between the drugs or hard core prison of the 1950s England where being a bit odd was enough to guarantee an incredibly difficult time as they say in England, I would’ve chosen the drugs as well hoping they would fix me, too bad without testosterone you’re going to be suicidal and depressed, I’d rather choose to keep my hair than to be horny all the time

Zenith@lemm.ee · 7 天前

Yeah we’re so stupid we’ve figured out advanced maths, physics, built incredible skyscrapers and the LHC, we may as individuals be less or more intelligent but humans as a whole are incredibly intelligent

Auli@lemmy.ca · 7 天前

No shit. This isn’t new.

melsaskca@lemmy.ca · 6 天前

It’s all “one instruction at a time” regardless of high processor speeds and words like “intelligent” being bandied about. “Reason” discussions should fall into the same query bucket as “sentience”.

MangoCats@feddit.it · 6 天前

My impression of LLM training and deployment is that it’s actually massively parallel in nature - which can be implemented one instruction at a time - but isn’t in practice.

Harbinger01173430@lemmy.world · 6 天前

XD so, like a regular school/university student that just wants to get passing grades?

Communist@lemmy.frozeninferno.xyz · edit-2 7 天前

I think it’s important to note (i’m not an llm I know that phrase triggers you to assume I am) that they haven’t proven this as an inherent architectural issue, which I think would be the next step to the assertion.

do we know that they don’t and are incapable of reasoning, or do we just know that for x problems they jump to memorized solutions, is it possible to create an arrangement of weights that can genuinely reason, even if the current models don’t? That’s the big question that needs answered. It’s still possible that we just haven’t properly incentivized reason over memorization during training.

if someone can objectively answer “no” to that, the bubble collapses.

MouldyCat@feddit.uk · 6 天前

In case you haven’t seen it, the paper is here - https://machinelearning.apple.com/research/illusion-of-thinking (PDF linked on the left).

The puzzles the researchers have chosen are spatial and logical reasoning puzzles - so certainly not the natural domain of LLMs. The paper doesn’t unfortunately give a clear definition of reasoning, I think I might surmise it as “analysing a scenario and extracting rules that allow you to achieve a desired outcome”.

They also don’t provide the prompts they use - not even for the cases where they say they provide the algorithm in the prompt, which makes that aspect less convincing to me.

What I did find noteworthy was how the models were able to provide around 100 steps correctly for larger Tower of Hanoi problems, but only 4 or 5 correct steps for larger River Crossing problems. I think the River Crossing problem is like the one where you have a boatman who wants to get a fox, a chicken and a bag of rice across a river, but can only take two in his boat at one time? In any case, the researchers suggest that this could be because there will be plenty of examples of Towers of Hanoi with larger numbers of disks, while not so many examples of the River Crossing with a lot more than the typical number of items being ferried across. This being more evidence that the LLMs (and LRMs) are merely recalling examples they’ve seen, rather than genuinely working them out.

Knock_Knock_Lemmy_In@lemmy.world · 6 天前

do we know that they don’t and are incapable of reasoning.

“even when we provide the algorithm in the prompt—so that the model only needs to execute the prescribed steps—performance does not improve”

Communist@lemmy.frozeninferno.xyz · edit-2 6 天前

That indicates that this particular model does not follow instructions, not that it is architecturally fundamentally incapable.

Knock_Knock_Lemmy_In@lemmy.world · 6 天前

Not “This particular model”. Frontier LRMs s OpenAI’s o1/o3,DeepSeek-R, Claude 3.7 Sonnet Thinking, and Gemini Thinking.

The paper shows that Large Reasoning Models as defined today cannot interpret instructions. Their architecture does not allow it.

Communist@lemmy.frozeninferno.xyz · edit-2 6 天前

those particular models. It does not prove the architecture doesn’t allow it at all. It’s still possible that this is solvable with a different training technique, and none of those are using the right one. that’s what they need to prove wrong.

this proves the issue is widespread, not fundamental.

0ops@lemm.ee · 6 天前

Is “model” not defined as architecture+weights? Those models certainly don’t share the same architecture. I might just be confused about your point though

Communist@lemmy.frozeninferno.xyz · edit-2 6 天前

It is, but this did not prove all architectures cannot reason, nor did it prove that all sets of weights cannot reason.

essentially they did not prove the issue is fundamental. And they have a pretty similar architecture, they’re all transformers trained in a similar way. I would not say they have different architectures.

0ops@lemm.ee · 6 天前

Ah, gotcha

Knock_Knock_Lemmy_In@lemmy.world · 6 天前

The architecture of these LRMs may make monkeys fly out of my butt. It hasn’t been proven that the architecture doesn’t allow it.

You are asking to prove a negative. The onus is to show that the architecture can reason. Not to prove that it can’t.

Communist@lemmy.frozeninferno.xyz · edit-2 6 天前

that’s very true, I’m just saying this paper did not eliminate the possibility and is thus not as significant as it sounds. If they had accomplished that, the bubble would collapse, this will not meaningfully change anything, however.

also, it’s not as unreasonable as that because these are automatically assembled bundles of simulated neurons.

Knock_Knock_Lemmy_In@lemmy.world · 6 天前

This paper does provide a solid proof by counterexample of reasoning not occuring (following an algorithm) when it should.

The paper doesn’t need to prove that reasoning never has or will occur. It’s only demonstrates that current claims of AI reasoning are overhyped.

Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well.

Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well.

archive.is