Safety by design or by testing: What’s best for healthcare AI?

When it comes to artificial intelligence in healthcare, there’s no room for error. The stakes are incredibly high, making safety and accuracy non-negotiable. This raises a fundamental question for developers: What’s the best way to build safe AI systems? Should you design them to be incapable of making mistakes from the start, or should you build flexible systems and test them rigorously until you’re confident in their safety?

At Infinitus, we believe the answer is a combination of both. A truly robust and trustworthy AI requires a foundation of “safety by design,” reinforced by continuous and thorough “safety by testing.”

Here’s a look at just what that means.

The limits of ‘testing your way to safety’

Some AI companies emphasize their extensive testing as proof of their product’s safety. They might, for example, run an AI through 500,000 successful test calls and declare it safe. However, this approach alone is a significant risk. Because large language models (LLMs) are probabilistic, a model that performs perfectly for 500,000 calls could still make a critical error on the 500,001th call. No amount of pre-launch testing can completely eliminate this possibility.

The risk is even greater when testing occurs in closed, non-real-world environments. While rigorous testing is essential, it cannot be the only pillar of your safety strategy.

Building safety in from the ground up

“Safety by design” means architecting an AI system in a way that prevents mistakes and hallucinations from occurring in the first place. At Infinitus, we achieve this through what we call a discrete action space.

This approach sets a limited and predefined set of actions and verbal responses the AI is permitted to use. While the AI uses powerful LLMs for understanding and processing language, it cannot generate its own responses for critical information.

For instance, when an AI needs to provide a patient’s date of birth, it doesn’t generate the date itself from the patient’s data provided via a prompt. Instead, the LLM’s role is to understand the request and select the correct, pre-approved action. The system then pulls the exact information from a secure database and delivers it using a function generated response. This deterministic process ensures that sensitive data is handled with precision and that the AI cannot “hallucinate” or invent incorrect information. This method also guarantees that the language used is always compliant with regulatory requirements and internal protocols, and doesn’t discuss topics that it’s not allowed to.

By separating the action of understanding from the action of generating language, we get the best of both worlds: the flexibility of LLM-based understanding and the strict accuracy of a deterministic system.

Safety by testing: the second pillar of trust

Even with a “safety by design” framework, testing remains a critical component of ensuring AI safety – both before and after the system goes live. Infinitus has built call simulation frameworks to test all our AI agents before launching.

Safety testing shouldn’t stop at launch either. We believe in continuous, real-world auditing of our AI’s performance, which includes human-in-the-loop reviews. A statistically significant percentage of Infinitus calls are monitored, and any unusual responses are automatically flagged for review. This ongoing analysis allows us to quickly spot anomalies, refine the AI’s behavior, and maintain the trust of our customers.

Even within safety by testing, Infinitus does not believe that healthcare AI agents should be tested once and deployed forever. Testing needs to be continuous.

Evolving safety by design for the future.

The discrete action space is a highly effective method for ensuring safety by design with the current generation of text-based LLMs. This model is integral to our processes, from managing interactions and extracting key information to scoring call quality.

However, the field of AI is rapidly evolving. The future points towards audio-to-audio models that can offer more scalable, multilingual experiences that reach a much wider healthcare population, and excellent bedside manner and empathy. With these newer models, an LLM may need to generate text and speech directly for us to exercise its full power, which introduces new safety challenges.

Infinitus is actively researching how to implement “safety by design” for these next-generation models. This includes developing sophisticated filtering mechanisms, validator models to check outputs, and checker models for continuous oversight, and methods to control the workflow through the right types of callbacks.

To tackle these exciting challenges, we’re expanding our world-class call automation and safety teams and are hiring for those passionate about building the future of safe and reliable healthcare AI.

Safety by design or by testing: What’s best for healthcare AI?

The limits of ‘testing your way to safety’

Building safety in from the ground up

Safety by testing: the second pillar of trust

Evolving safety by design for the future.

Recommended articles

Scaling agentic AI: Intern-led advances in MCP, IVR, multilingual capabilities and more

How to evaluate agentic AI solutions for healthcare

Inside the Infinitus MCP journey: Bringing Model Context Protocol to healthcare AI

See why more healthcare orgs choose Infinitus over any other voice AI.