Automating benefit verification phone calls saves time in healthcare and helps patients receive treatment faster. It is critical to obtain highly accurate information in these phone calls because it can affect a patient’s healthcare journey. However, it is also very difficult to extract all the required information from real-time streaming speech-to-text (STT) models with high accuracy in phone conversation settings, because of latency requirements and various types of automatic speech recognition (ASR) transcript errors. 

Although we have a two-stage AI review system that involves a post-call review phase for validating collected information, some ASR mistranscripts can still cause challenging issues that cannot be easily solved, even by state-of-the-art multimodal LLMs or specialized ASR systems (as illustrated in the call example in Figure 1 below). To that end, we developed an advanced transcript processing pipeline, which improves the downstream AI review performance and obtains the best performance compared to other state-of-the-art modeling approaches. 

We’re looking forward to presenting the results of our research at the Annual Meeting of the Association for Computational Linguistics in July, and are excited to share the the full paper here, as well as more details below. 

Figure 1. Live-call ASR transcript tends to be noisy; this can propagate through our downstream post-call AI review system and contribute to errors in our patient benefit data. Our second stage AI review model approves the correct information and flags potential errors for human reviews.

Some ASR errors are very hard to catch

Our multimodel, multimodal AI pipeline has a very high accuracy for information extraction in general. Still, we see relatively higher error correction rates from human reviewers when information fields are prone to phonetic errors and in the format of long alphanumeric sequences at the post-call processing stage. 

Regardless of how accurate base ASR models are, there are limitations for precisely transcribing human speech solely on audio without surrounding context and domain-specific knowledge. In the example call in Figure 1, the live call ASR model transcribed the agent name incorrectly with more common spelling (“Cayden”)

Other tricky ASR error cases include, but are not limited to, the following examples in Figure 2:

Figure 2. Examples of ASR transcript errors and actual values of collected information. 

Given these challenges, a natural question arises: can state-of-the-art multimodal LLMs or fine-tuned ASR models effectively resolve these tricky errors?

Can we capture these ASR errors with state-of-the-art multimodal LLM or fine-tuned ASR models?

For preliminary experiments, we tried to resolve ASR transcription errors with multimodal LLM models and fine-tuned ASR models.

We sent call audio recordings to Gemini to extract information but we observed similar ASR errors such as duplicated ‘0’s for long alphanumeric sequence values. We fine-tuned the Open AI Whisper ASR model and found that it increases overall accuracy (e.g., word error rates, edit distances) but it still tends to make similar mistakes, such as dropping letters or digits for long sequences, especially with patterns that did not exist in its training set.

Scalable ASR error detection and error correction models

To address these ASR error issues, we developed a specialized context-aware transcript processing pipeline. Our pipeline does not require any human annotations, so it is scalable. Furthermore, it has a generalizable model architecture that is applicable to any types of base ASR models.

For training our transcript processing models, we first generate pseudo labels for correct ASR transcripts by sending N ASR alternative transcripts and the actual correct values to Gemini. Gemini then generates the correct transcripts, which serve as pseudo labels. These pseudo labels are then used for training our ASR error correction (AEC) model for correcting ASR mistranscripts and ASR error detection (AED) models for detecting ASR transcription errors; we used Mistral LLM model as it outperformed the similar-sized Llama model

n asr error handling component

Figure 3. An overview of the ASR error handling component n ASR alternatives are used to generate the pseudo-labels that are then used for training the AEC model. During inference, the corrected utterances are inserted back into the transcript.

State-of-the-art AI review system with our specialized transcript processing models!

Now, within our AI review system, ASR transcripts for information fields with higher correction rates can be routed to our AEC and AED models. This process generates highly accurate STT and robust ASR error indicator features for downstream field classifiers as illustrated in Figure 4. According to our experiments, this approach obtained the best performance for our AI review tasks compared to state-of-the-art LLMs such as GPT or Gemini.

sample LLM-generated response

Figure 4. At post-call review phase, ASR transcripts with error-prone information fields can be sent to our transcript processing pipeline to improve our final AI review accuracy.

We’re happy that this work has been accepted at ACL 2025, and will be presented at its 2025 annual conference in Vienna, Austria. We’re also happy to share that Infinitus is hiring – and we’d love to share more about our other research projects in progress. You can check out more about us, and see our open roles, on our Careers page.

Ayesha Qamar, Arushi Raghuvanshi, and Conal Sathi also contributed to this research.