There has been a massive rise in the popularity of voice assistants in the last decade including products like Siri, Alexa, Google Assistant, Cortana, and more. There are over a billion devices today with a voice assistant enabled, and some forecast that number to reach a staggering 8 billion in the next few years. [Techcrunch

Almost all of these popular virtual assistants are task-oriented dialog systems. Meaning, they are built to understand and complete specific tasks. For example, playing a song, checking the weather, or setting an alarm. Each of these tasks can usually be completed in a single or a handful of utterances.

There hasn’t been as much production level automation in longer form conversations, including phone calls, but there is a need for it. Many business operations today still happen over the phone. To help get a sense for inefficiencies in these phone calls, think of a time where you called a business and were placed on a long hold. Now consider if most of your company’s operations are calling other businesses, you can see how quickly those inefficiencies build up.

There are a few companies that have started automating tasks over the phone, but the space is still very much in the early stages. The most widely known, Google Duplex, can make restaurant reservations []. There are also some platforms that support automating certain parts of the call space. Examples include real time analytics for call centers, tools that can surface the right information to human agents, or platforms that help automate the initial part of a call or connect people to the right agent. But we’ve seen very few cases where these tools have been used to fully automate an outbound call.

There are so many fewer voice assistants in this space, because building a digital assistant which can successfully handle phone calls has a unique set of challenges (in addition to those that exist for many of the voice assistants and chatbots in use today). In this blogpost we will outline some of these challenges.

Length of a call

One of the main differences between a virtual assistant that can complete single requests and one that can handle a phone call is the length of the conversation. The system needs to understand the context of previous interactions, which gets more difficult with length.

Most conversational AI platforms today have the ability to store conversational context. This can be the text of the last few user utterances, or more structured information like what were the previous intents or extracted slots. As conversations get longer, relevant information may exist not only in the last few interactions, but in the last ten or twenty interactions. Supporting these conversations involves a more complex dialog system – often with a combination of NLP model outputs, dialog state, contextual information, stored information from previous interactions, and more. 

In these longer form conversations, there are also often points where the topic of conversation switches. At those points it’s important for the system to understand that the context from the previous topic may not be relevant to the current one. Again, a more complex dialog system is needed to train the system to understand and handle context switching.

User expectation and trust

When talking to a virtual assistant, say Alexa, you are driving the request. Users tend to have more patience with virtual assistants when they are the ones initiating the interaction. And after using a virtual assistant a few times, users start to build trust around what tasks the assistant can take care of successfully.

Over the phone, however, there is an expectation that you will be talking to a human. When getting a digital assistant for the first time, there is often a lack of trust that it can truly understand and carry out the conversation. This mistrust may come from bad interactions with automated systems over the phone that require users to respond with a single word or a button press. When first encountering a digital assistant, people may ask to speak to a live agent, which the assistant needs to elegantly respond to. If the person on the phone is more open to trying out an interaction, there is usually a bit of back and forth at the beginning of the conversation with a high bar for accuracy to build trust. Once the person sees that they can talk naturally and have the digital assistant understand them, they start to open up.

As the space gets more mature, and people have had successful conversations with digital assistants over the phone before, this problem will start to go away. But since it is early days, even getting to a point where a digital assistant can start asking questions over the phone takes some evangelization and effort.


Most devices that support voice assistants have some sort of visual feedback. It might be a light at the top of the device or an icon on a mobile phone. This lets users know that the system has at least heard them and is processing the request. Over the phone, there is no such visual feedback, so the latency needs to be much faster. If not a fully formed response, there needs to at least be a filler word or phrase that lets the person know that they have been heard.

Getting latency to a point where the conversation feels natural over the phone is an engineering challenge. Every part of the system needs to be efficient from the speech recognition to the models used for natural language processing, any database queries or API calls, and speech generation.

Audio quality

Most devices that support voice assistants have relatively high quality microphones. A sample rate of 44.1kHz and bit-depth of 16 has become pretty standard. Audio quality over the phone, however, is much worse with a standard sample rate of 8kHz and bit-depth of 8. Some cell networks advertise better quality audio at times, but it is dynamically compressed based on the network signal, which on average leads to worse quality. And most soft-phone systems match the 8kHz sample rate of landline phones.

With lower quality audio, the speech-to-text transcripts are often noisier. With noisy transcripts, it becomes more challenging for the following natural language processing models to understand and respond to the user.

There are many possible approaches to solving the problem of noisy transcripts. One is training an in-house speech recognition model on audio and language specific to the application domain. Another is to leverage settings such as keyword hints to provide additional context to third party speech to text APIs. The set of n-best alternative transcripts can be re-ranked, taking a custom language model into consideration. Downstream models can be trained on noisy data or take phonetic signals as input features. And so on. In practice, a combination of these approaches can be used.

Navigating IVRs 

When calling a business, the first step is often navigating an Interactive Voice Response (IVR) system. These are the automated systems that say something like ‘If you’re calling about benefits press 1, if you’re calling about claims press 2, …’ They often rely on DTMF tones from phone keypad clicks and can be quite brittle. When automating calls, you not only have to build models that can talk to a human, but also build ones that can navigate through what seems like an unbounded set of IVR systems.

Detecting human speech

Most voice assistants are activated by a wakeword. For example, ‘Hey Siri’, ‘Ok Google’, ‘Alexa’, etc. After the wakeword is activated, there is a clear point where the assistant starts listening to the request to respond to.

Over the phone, however, there is no clear delineation of when the assistant is being spoken to. We need to have additional models that can distinguish between a person’s voice, hold music, an automated voice, and background noise in order for the digital assistant to know when to listen and respond. 

Automating calls for business

This final category is not unique to phone calls, but to any application in the enterprise sector. When using AI in an enterprise setting, the accuracy requirements are much higher than consumer applications. Not only does the conversation need to be smooth, the data extraction needs to be perfect. For example, in the healthcare space, the outcome of a benefits verification phone call can determine if and when a patient can receive treatment, which is potentially life saving information. For these types of applications, model accuracy is even more crucial.


If problems like these excite you, we’d love to chat with you and discuss how we are using our platform to help improve efficiencies in healthcare.