Study: ChatGPT Underestimates Medical Emergencies, Misses Suicide Risk

It was 2 a.m. when a friend texted me: “My chest feels tight—should I go to the ER?”

They pasted ChatGPT Health’s reply: “Monitor at home and check in 24–48 hours.”

I read the Mount Sinai paper and felt a chill—because that guidance, the researchers found, was exactly the kind that could cost time, and lives.

I write to you as someone who follows clinical safety studies and talks to clinicians who use these systems. The Icahn School of Medicine at Mount Sinai published the first independent safety evaluation of OpenAI’s ChatGPT Health after its January 2026 launch, and the findings are stark: the system under-triaged more than half of emergency cases in a controlled test.

In an emergency room shift, I watched a single sentence change a plan of care — then unravel.

The Mount Sinai team tested ChatGPT Health’s ability to triage — that is, to judge whether a patient needs immediate emergency care. In simulated scenarios the tool “under-triaged” 52% of emergency cases, including diabetic ketoacidosis and impending respiratory failure. In one respiratory failure vignette the assistant flagged early warning signs but still told the patient to wait and monitor rather than to seek immediate help.

That kind of error feels like a smoke alarm that goes quiet when the house is filling with smoke: it recognizes danger, then fails to shout.

ChatGPT Health did better with textbook crises such as stroke and anaphylaxis, but the study authors argue the most dangerous failures happen in the gray zones — the nuanced cases where clinical judgment really matters. Lead author and urologist Ashwin Ramaswamy framed the study plainly: if someone with a real emergency turns to ChatGPT Health, will it clearly tell them to go to the emergency room? Most of the time, Mount Sinai’s answer was no.

Can ChatGPT tell you to go to the emergency room?

Short answer: sometimes. In controlled testing the assistant missed many urgent signals. OpenAI points out that real conversations can include follow-up questions and clarifications that change outcomes, and that ChatGPT Health is still in limited release with a waiting list for access. But the controlled scenarios were designed to measure immediate triage decisions — the very moments when a firm instruction to seek emergency care matters most.

At a support group, I listened to people describe relying on a chatbot for crisis advice.

Those conversations echoed the study’s concerns about mental-health handling. Mount Sinai found suicide-risk alerts were inconsistent: the system produced warnings more often for lower-risk phrasing and less often when a person described explicit intent and method. Girish Nadkarni, Mount Sinai Health System’s chief AI officer, called the inversion “surprising and concerning.”

That inconsistency has real fallout: grieving families have sued OpenAI alleging chatbot interactions worsened suicidal ideation in relatives, and watchdogs say past failures show the stakes aren’t theoretical. OpenAI says it is rolling safety updates — parental controls, nudges to take breaks, and routing to professional help in high-risk cases — but the Mount Sinai data show those safeguards don’t always trigger when they’re needed.

It’s a tool with the trappings of empathy but the reflexes of software, more like a compass that points north sometimes and spins the rest of the time.

Is ChatGPT safe for suicide-risk assessments?

Mount Sinai’s paper suggests not reliably. Alerts appeared inverted relative to clinical risk, and the most explicit expressions of intent sometimes failed to prompt the strongest responses. OpenAI stresses iterative improvements, but clinicians and families are rightly asking whether iterative patches are enough for life-or-death situations.

On my phone, the OpenAI report flashed a staggering reach: tens of millions asking health questions daily.

OpenAI reported more than 40 million people globally used its chatbot daily for health questions; 7-in-10 of those health conversations happened outside normal clinic hours, and on average more than 580,000 U.S. healthcare inquiries came from “hospital deserts” — places more than a 30-minute drive from a general medical or children’s hospital. Those numbers show why the stakes matter.

Regulators have been permissive: some AI tools can renew prescriptions in Utah, and FDA Commissioner Marty Makary has said certain devices and software may provide health information without FDA oversight. That regulatory latitude is letting AI burrow into care delivery quickly — but access and convenience don’t erase documented physical and mental health risks when the systems misjudge acuity.

How accurate is ChatGPT Health?

The system can spot classic emergencies, but Mount Sinai’s controlled scenarios revealed consistent blind spots in nuanced emergencies and mental-health risk assessment. OpenAI points to real-world follow-ups and ongoing safety work as mitigating factors, but the study intentionally tested first-response triage — the crucial moments before a clinician or emergency operator can intervene.

In conversations with clinicians, I keep returning to one practical question: what should you do?

If you suspect someone has an emergency — shortness of breath, chest pain, severe altered mental status, signs of ketoacidosis, or explicit suicidal intent — trust human-led emergency pathways first. Use AI as a guide, not a gatekeeper. Hospitals, public-health agencies, and developers like OpenAI must confront these gaps with faster, transparent fixes and real-world testing that mirrors how people actually use the tools.

Mount Sinai’s study is a warning: millions are turning to chatbots for health answers, but a single under-triage can mean lost minutes and worsened outcomes. Will regulators, hospitals, and developers move quickly enough to match the speed at which people are handing their symptoms to code?