What Mount Sinai’s OpenAI Health study tells us about AI’s role in healthcare

What is AI not good at?

It’s a good and timely question, but also the wrong question to ask. That’s especially true if you are an AI company in healthcare or if you are a customer of AI products in healthcare. From the perspective of an AI company in healthcare, it is the wrong question to ask because we don’t “ship AI;” we ship purpose-built products that use AI because it is the best technology to meet the needs of our customers and the patients they serve.

So instead, the question to ask is: What is the purpose and for whom is it built?

Infinitus recently celebrated its seventh anniversary. This is a way of saying that we predate the current AI hype cycle. Our founders and early employees used the state-of-the-art natural language technology of the time to build AI agents to serve the needs of customers in the form of products that solved important back-office problems in healthcare. Yes, newer and better LLMs have turbocharged what we can do, especially for patient-facing agents, but the mission and the motivating questions remain the same.

So to revisit that initial question, at Infinitus, we put intense focus on what AI is bad at and what it’s good at. One recently published high-visibility manuscript, “ChatGPT Health performance in a structured test of triage recommendations,” caught our attention. Published in Nature Medicine, the paper comes from the Artificial Intelligence and Human Health department in the Icahn School of Medicine at Mount Sinai in New York City. The authors systematically evaluated the ability of a healthcare-focused AI chatbot (here, OpenAI’s ChatGPT Health) to accurately triage expertly curated patient vignettes. The surprising results: The chatbot does not do a good job of accurately triaging vignettes that should lead to the patient going to the emergency room. It also did not do a good job of triaging vignettes that should lead to a patient staying at home and follow up with a care provider as needed.

This stood out because generally, LLMs are very good at answering medical questions. However, as the authors correctly point out, most tests of medical competence to date focus on the ability of LLMs to answer questions on standardised tests. They don’t focus on real-world scenarios with concrete stakes, where errors in judgements can have dire consequences.

The authors also published the vignettes for others to use. This is important because these expertly curated and labeled vignettes will serve as an important database for AI safety in medicine (a major aspect of my work at Infinitus). It also is important because it allows curious people like me to test the performance of other models in the task of triage.

What I found was this: the Google “Gemini 3 Flash” model outperformed ChatGPT Health in triaging cases that should go to the ED (88% vs. 48% accuracy) and underperformed it in triaging cases that can stay at home (13% vs. 35%). One could argue that it is better to over-triage low-acuity cases than it is to under-triage high-acuity cases, and I would generally agree. But, again, like the question at the outset of this post, I think this is missing the bigger picture.

Here’s what I mean: “Out-of-the-box” performance of widely available LLM models will continue to improve in specialized areas, whether its medicine, law, mathematics, or anything else. It is important, therefore, to use expertly curated datasets such as those generated by the Mount Sinai group to quantify this out-of-the-box performance.

The second aspect is really my opening point in this post. At Infinitus, we think that the path to success in healthcare AI is shipping real products that solve real problems. These purpose-built products are evaluated using timeless engineering principles adapted to new realities and capabilities, such as the stochastic nature of LLM outputs (new reality) and the capacity to use AI-simulated patients to perform high-volume stress-testing of our patient-facing agents (new capability).

We use customer specifications with our own internal standards for professionalism, empathy, and safety to develop products powered by a code base that uses traditional software infrastructure integrated with natural language instruction. The dual nature of this code base (code as code and natural language as code) forces careful writing and reading, with a keen eye on logic, meaning, and design purpose.

In the end, studies like the Mount Sinai evaluation are valuable not because they settle the question of whether AI works in healthcare, but because they sharpen the questions we should be asking about how to build and deploy products. The future of healthcare AI will not be defined by models alone, but by the products, safeguards, and engineering discipline that turn those models into systems that the entire ecosystem can trust.

What Mount Sinai’s OpenAI Health study tells us about AI’s role in healthcare

Recommended articles

How we measure what matters: Evaluating the human quality of AI conversations at Infinitus

Most health questions aren’t general: What Microsoft’s study reveals about patients’ AI use

Why voice AI works especially well for older patients

See why more healthcare orgs choose Infinitus over any other voice AI.