OpenAI’s ChatGPT has become a popular go-to for quick responses to questions of all types — but a new study in JAMA Oncology suggests that the artificial intelligence chatbot might have some serious shortcomings when it comes to doling out medical advice for cancer treatment.
Researchers from Mass General Brigham, Sloan Kettering and Boston Children’s Hospital put ChatGPT to the test by compiling 104 different prompts and asking the chatbot for recommendations on cancer treatments.
Next, they had a team of four board-certified oncologists review and score the responses using five criteria.
Overall, ChatGPT scored an underwhelming 61.9%.
Although language learning models (LLMs) have successfully passed the U.S. Medical Licensing Examination, the chatbot underperformed when it came to providing accurate cancer treatment recommendations that align with National Comprehensive Cancer Network (NCCN) guidelines.
In many cases, the responses were unclear or mixed inaccurate and accurate information.
Nearly 13% of the responses were “hallucinated,” which means they might have sounded factual, but were completely inaccurate or unrelated to the prompt, according to the researchers’ findings.
“This is a significant concern, as it could lead to misinformation and potentially harmful patient decisions,” said Dr. Harvey Castro, an emergency medicine physician and AI expert in Coppell, Texas.
Castro was not involved in the study but commented on the findings.
“For example, a patient with advanced lung cancer may receive a recommendation for a treatment not recognized by the NCCN guidelines, which could lead to delays in receiving appropriate care.”
Danielle Bitterman, study co-author and assistant professor of radiation oncology at Harvard Medical School, said that overall, the results met expectations.
“ChatGPT and many of the similar large language models are trained primarily to function as chatbots, but they are not specifically trained to reliably provide factually correct information,” she told Fox News Digital.
“Our results showed that the model is good at speaking fluently and mimicking human language,” she noted. “But a challenging aspect for health advice is that it makes it hard to detect correct versus incorrect information.”
She went on, “When reading the responses, I was struck by how correct treatment options were seamlessly mixed in with wrong ones. Also, I was encouraged that almost all responses did contain some correct information — this shows the future potential of models to communicate information in collaboration with physician input, even if we aren’t there yet,” she added.
The study’s key limitation was that the researchers evaluated only one LLM in one “snapshot in time”; but they believe the findings highlight legitimate concerns and the need for future research.
ChatGPT 3.5 was used for this study, but OpenAI released a newer model, GPT 4, after the research concluded.
“Nevertheless, the model we tested is the one that is publicly available and the most accessible by a wide population of patients,” Bitterman said.
The researchers also did not do intensive investigations into prompt engineering, which may have improved results, she added.
“Instead, we designed our prompts (questions) from the perspective of a general member of the population asking general questions about cancer treatment.”
Also, the study does not discuss the ethical considerations of using AI chatbots for providing cancer treatment recommendations, noted Dr. Castro.
“While AI chatbots can be a valuable tool, they should be used as a supplement, not a replacement, for professional medical advice.”
“It is important to consider the potential risks and benefits of using AI chatbots in this context and have safeguards to ensure that patients receive accurate and appropriate recommendations,” he told Fox News Digital.
Castro said he sees promise in the use of AI chatbots for providing cancer treatment information — but significant challenges still need to be addressed.
“While AI chatbots can be a valuable tool, they should be used as a supplement, not a replacement, for professional medical advice,” he said.
“As a physician, it is important to remain cautious and continue relying on established guidelines and clinical expertise when making treatment recommendations,” Castro went on.
“There is too much at stake if we get this wrong.”
“Future research must assess AI chatbots’ long-term impact and generalizability in cancer treatment and patient self-education.”
Also, Castro would like to see future studies assess more types of cancer.
“The study assessed the chatbot’s performance in providing breast, prostate and lung cancer treatment recommendations,” he noted. “It is unknown how the chatbot would perform in giving suggestions for other types of cancer or other medical conditions.”
While generalist models like ChatGPT are not trained to provide medical advice — and the quality of the information “doesn’t meet the bar for medicine” — Bitterman said they do show potential for synthesizing information in accessible language.
“There is much excitement and potential of AI in health care, but we need to carefully evaluate our models at each step and optimize them for the high-stakes clinical domain,” she told Fox News Digital.
With medicine and standards of care constantly evolving, Bitterman noted that if a model were developed for clinical use, it would have to provide up-to-date guidelines.
“This will require that developers provide transparency about what data the models were trained on and re-evaluate their performance over time,” she said.
“There is too much at stake if we get this wrong — and patient safety is paramount,” Bitterman added.
“If there are early errors due to hasty uptake without sufficient testing, it could ultimately set the field back and slow the potential gains.”