The 'GhostEdit' test: An approach to catching AI hallucinations before they’re heard

Voice AI has a latency problem: users expect instant responses, but hallucination checks can take seconds. In real-time conversation, adding a verification step before speech often makes an AI agent feel slow and unnatural.

GhostEdit – a real-time safety layer we built at Infinitus that checks each spoken AI response against grounded knowledge and triggers an immediate correction if something does not match – takes a different approach. We sought to understand whether GhostEdit could be a solution for our patient- and payor-facing phone calls.

To evaluate it, we ran a small controlled test across simulated calls (GhostEdit is not in deployment for our live conversations). Rather than waiting for organic errors, we deliberately injected known source-of-truth mismatches into otherwise normal call flows.

For example, the approved knowledge base might contain one value, while the test harness inserts a different value into the agent’s spoken response. The goal was to create a clear verification challenge: could the detector compare the spoken sentence against the grounded knowledge base, identify the discrepancy, and trigger a correction quickly enough to preserve the flow of the conversation?

We were looking to learn three things:

1. Accuracy of detection

2. Speed of detection

3. User-facing correction delay

Instead of blocking speech while a detector verifies the answer, the agent speaks immediately while a hallucination detector runs in parallel. The trick is that text generation is usually faster than spoken audio. That creates a small but valuable window: the system can inspect upcoming sentences before the caller actually hears them.

Each generated sentence follows two paths. One path goes straight to text-to-speech. The other goes to a detector that checks the sentence against the system rules, conversation history, and available knowledge. If the detector finds a problem before the audio plays, GhostEdit silently removes or replaces that part. The caller hears only the corrected version.

If the incorrect audio has already been heard, the agent makes an explicit correction: “Actually, let me correct that.”

This design keeps voice latency low while still reducing harmful mistakes. It also avoids a common trap: storing correction instructions in memory. Correction hints should be temporary, used only to fix the current response, and then discarded. Otherwise, the agent can become overly cautious or start changing its behavior in later turns.

We tested this on a small sample of simulated calls by intentionally injecting incorrect agent responses. For example, the knowledge base might say a customer’s copay is $40, while the agent incorrectly says, “Your copay is zero dollars.” The detector catches the mismatch, labels the sentence as bad, and the agent follows with a correction such as: “Actually, let me correct that, your copay is $40.”

The early metrics were encouraging: The detector verdicts returned in about 92 ms on average, while bad verdicts took around 200 ms, likely because the detector needed more reasoning time to compare the spoken sentence against the grounded KB truth.

The user-facing correction delay was also small. The metric ghost_edit_correction_latency_seconds measured the time between the end of the agent’s incorrect audio and the start of the correction. In the sample, corrections started after roughly 224–251 ms of silence, about a quarter of a second. That is the key production metric: once it rises above 500 ms, the pause may start to feel awkward or confusing.

We also tracked ghost_edit_corrections_total, the total number of live corrections. In normal production traffic, this should stay close to zero. A sudden spike could mean the assistant prompt is causing more hallucinations, the detector is over-triggering, or the detector service has a bug. A useful alert would be:

rate(ghost_edit_corrections_total[5m]) > 0.5

Our next step is to move from a small controlled sample to a broader evaluation set. That means testing more call types, more knowledge-base entries, more edge cases, and more realistic pharma-specific scenarios, including eligibility, dosing-language boundaries, adverse event routing, patient support workflows, consent language, and escalation paths

The takeaway is not that GhostEdit replaces grounded generation – instead, it strengthens it. The test shows the potential for real-time verification to act as a practical safety net for voice AI, catching source-of-truth mismatches quickly, measuring them clearly, and giving teams a path to improve reliability before broader deployment.

The key idea is simple: don’t make safety a gate that slows speech down. Make it a race that runs alongside speech.

After all, in real-time voice AI, the best hallucination correction system is one the user never notices.

The ‘GhostEdit’ test: An approach to catching AI hallucinations before they’re heard

Recommended articles

Using concept activation vectors as semantic filters in production

Beyond LeetCode: Interviewing for real engineering impact

Safety in every word: Automating adverse event detection

See why more healthcare orgs choose Infinitus over any other voice AI.