ChatGPT bombs test on diagnosing kids’ medical cases with 83% error rate | It was bad at recognizing relationships and needs selective training, researchers say.

L4sBot@lemmy.world · 10 months ago

ChatGPT bombs test on diagnosing kids’ medical cases with 83% error rate | It was bad at recognizing relationships and needs selective training, researchers say.

Darorad@lemmy.world · 10 months ago

Why do people keep expecting a language model to be able to do literally everything. AI works best when it’s a model trained to solve a problem. You can’t just throw everything at a chatbot and expect it to have any sort of competence.

xkforce@lemmy.world · 10 months ago

The average person isn’t very smart. All they see is a magical black box that goes brr.

JaymesRS@literature.cafe · edit-2 10 months ago

My wife is a physician and I’ve talked with her about this with regards to healthcare in general. Most people still think of healthcare like a visiting a wizard for a potion or somatic incantation.

So throw 2 black box-type problems at each other and I have no doubt that a lot of people would be surprised that the results are crap.

TheGoldenGod@lemmy.world · 10 months ago

Pretty much this.

echo64@lemmy.world · 10 months ago

Because you can talk to it and it’s programmed to make you think it knows a lot and is capable of doing so much more.

People expect it to do more because chatgpt was trained to make people expect it to do more.

It’s all lies, of course. Chargpt fails at more than the simplest of tasks and can’t use any new information because the internet is full of ai generated text now, which is poison to training models. But it’s good at pretending.

FinishingDutch@lemmy.world · edit-2 10 months ago

The thing that really annoys me is the people who are most enamoured with Chat GPT also seem to be the ones least capable of judging its accuracy and actual output quality.

I write for a living; a newspaper. So naturally, some of the people in our company - sales people - wanted to test it. And they were delighted with the stuff it wrote. Which was terrible to read, factually incorrect, repetitive and just not something we’d put in the paper. But they loved it. Because they weren’t writers and don’t know how to write an engaging article with proper sources.

I tested it as well. Wanted to form my own opinion and read up on the limitations, how to write good prompts, etc. So I could give it a fair chance.

I had it write a basic 500 word article about things to see in our city, with information about the tourist info office. That’s something a first year intern can do in his second week with us.

Basically, it ended up ‘inventing’ two museums that don’t exist, it listed info for a museum on the other side of the country, it listed an ‘Olympic stadium’ (we never hosted the Olympics) and it gave a completely wrong address for the tourist info, even though it should have it.

It was factually incorrect in just about every sentence. But it all sounded plausible enough and was written with such confidence that anyone not from this city might assume it to be true.

I don’t want that fucking thing anywhere NEAR my newspaper. The sales people are pretty much monkeys with Chat GPT-typewriters, churning out drivel instead of Shakespeare.

LWD@lemm.ee · edit-2 10 months ago

deleted

📛Maven@lemmy.sdf.org · 10 months ago

the internet is full of ai generated text now, which is poison to training models. But it’s good at pretending.

This misconception shows up again and again. It’s wishful thinking from people who want to think AI researchers are idiots and AIs are going to kill themselves.

These models aren’t trained on “the internet”. They don’t just thoughtlessly rip everything that’s ever been posted every time they want to make an updated bot. The vast bulk of training data was scraped years ago, predating the current tide of generative muck, and additions are carefully curated to avoid the exact thing you’re talking about. A scrape of the 2018 internet is plenty, and will remain so for years and years.

stevedidwhat_infosec@infosec.pub · 10 months ago

These articles may be more so about “it’s not for medical uses you fucking morons” and less so “WOAH WHO KNEW MAN”

kromem@lemmy.world · 10 months ago

Because when you use the SotA model and best practices in prompting it actually can do a lot of things really well, including diagnose medical cases:

We assessed the performance of the newly released AI GPT-4 in diagnosing complex medical case challenges and compared the success rate to that of medical-journal readers. GPT-4 correctly diagnosed 57% of cases, outperforming 99.98% of simulated human readers generated from online answers. We highlight the potential for AI to be a powerful supportive tool for diagnosis

Use of GPT-4 to Diagnose Complex Clinical Cases

The OP study isn’t using GPT-4. It’s using GPT-3.5, which is very dumb. So the finding is less “LLMs can’t diagnose pediatric cases” and more “we don’t know how to do meaningful research on LLMs in medicine.”

Cheers@sh.itjust.works · 10 months ago

Because Google’s med palm 2 is a medically trained chatbot that performs better than most med students, and some med professionals. Further training and refinement using new chatbot findings like mixture of experts and chain of thought are likely to improve results.

Darorad@lemmy.world · 10 months ago

Exactly, med-palm 2 was specifically trained for being a medical chatbot, not general purpose like chatgpt

Hotzilla@sopuli.xyz · 10 months ago

Train with the internet, get results like it is in Internet. Are medical content in Internet good? No, it is shit, so it will give shit results.

These are great base models, understanding larger context is always better for LLM, but specialization is needed for these kind of contexts.

ryannathans@aussie.zone · 10 months ago

Especially not now it has been nerfed to shit

TheSlad@sh.itjust.works · 10 months ago

“ChatGPT sucks at something it wasn’t trained to do”

🙄

kromem@lemmy.world · edit-2 10 months ago

This is a fucking terrible study.

They compare their results to a general diagnostic evaluation of GPT-4 which scored better and discuss it as relating to the fact it’s a pediatric focus.

While largely glossing over the fact they are using GPT-3.5 instead.

GPT-3.5 sucks for any critical reasoning tasks, and this is a pretty worthless study not using the SotA or using best practices in prompting to actually reflect what a production grade deployment of a LLM for pediatric diagnostics would be.

And we really need to stop just spamming upvotes for stuff with little actual worth just because it’s a negative headline about AI and that’s all the jazz these days.

Siegfried@lemmy.world · 10 months ago

Why don’t we stop acting like ordering words correctly can 100% replace any profesional?

Hotzilla@sopuli.xyz · 10 months ago

Can it be used as a tool for the professionals? Hell yes. Fear of losing jobs is hindering this discussion. These LLM models are tools, which can make people more efficient and make less mistakes.

Municipal0379@lemmy.world · 10 months ago

WebMD over here excited they aren’t the worse at web diagnosis anymore.

LainOfTheWired@lemy.lol · edit-2 10 months ago

You know as someone who lives in the UK our NHS( national health service, which is basically social health care) already has a website to help you figure out if you need to see a doctor( the 111 site), and it’s kinda useless. There are some things humans are simply better at, and understanding a humans physical needs is one of them.

I really think trying to replace doctors with AI is an awful idea.

I’m fine with it being used as another tool to help with the process, but that doesn’t seem to be the goal of this.

ForgotAboutDre@lemmy.world · 10 months ago

The NHS website is fantastic. It’s one of the best resources for getting good quality medical advice (if your not a medical professional). It ties symptoms to causes very well and provides information on the appropriate service you need if you have certain symptoms.

It’s not a substitute for doctors. It a means to get people to go to the correct service depending on their immediate need. I have used it to get family members to go to a doctor where they otherwise wouldn’t. It can help you be informed of any issues you are having, so you can see the possible treatment options. It tells you when a pharmacist can solve the issue rather than take time off work to go to a doctor’s appointment. It also tells you when to call 999 rather than wait for a GPs appointment.

I suspect your not actually reading through the articles or have some comprehension issue. It’s a fantastic tool, that is extremely useful. It’s particularly useful because it’s created by informed humans, not AI. It’s also one of the few medical resources that is trying to sell stuff to you.

LainOfTheWired@lemy.lol · 10 months ago

deleted by creator

kromem@lemmy.world · 10 months ago

It’s not about replacing. It’s about supplementing.

ViscloReader@lemmy.world · 10 months ago

Well of course its fucking chat GPT. I mean what did they expect? Are they doing like aperture science, throwing shitty experiments until something comes out eventually? Look at me, today I’m gonna try to see if my table is good to send SMS…

AutoTL;DR@lemmings.world · 10 months ago

This is the best summary I could come up with:

While the chatty AI bot has previously underwhelmed with its attempts to diagnose challenging medical cases—with an accuracy rate of 39 percent in an analysis last year—a study out this week in JAMA Pediatrics suggests the fourth version of the large language model is especially bad with kids.

The medical field has generally been an early adopter of AI-powered technologies, resulting in some notable failures, such as creating algorithmic racial bias, as well as successes, such as automating administrative tasks and helping to interpret chest scans and retinal images.

But AI’s potential for problem-solving has raised considerable interest in developing it into a helpful tool for complex diagnostics—no eccentric, prickly, pill-popping medical genius required.

For ChatGPT’s test, the researchers pasted the relevant text of the medical cases into the prompt, and then two qualified physician-researchers scored the AI-generated answers as correct, incorrect, or “did not fully capture the diagnosis.”

Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature—not stuff on the Internet, which can include inaccurate information and misinformation.

“This presents an opportunity for researchers to investigate if specific medical data training and tuning can improve the diagnostic accuracy of LLM-based chatbots,” the authors conclude.

The original article contains 721 words, the summary contains 211 words. Saved 71%. I’m a bot and I’m open source!