At Infinitus, we are creating an AI solution to help our customers automate their outbound phone calls. When Ankit and I first chatted about solving this problem, it seemed technically daunting: we had to build a platform that can hold a conversation lasting over 30 minutes with minimal errors, if any. While most conversational AI platforms are able to handle conversation state across 8-10 turns (one utterance by each of the participants in a conversation), our system has been built to handle state across over 100 turns. I’m happy to share that just a year and a half later, our system has successfully completed over ten thousand such phone calls. In this post, I outline some of the breakthroughs we have had to make and preview some of the exciting technical challenges that lay ahead.

The Infinitus Tech Stack

Every phone call starts with our phone VOIP layer. Unlike traditional short lived REST requests, the data for the phone calls happen via longer lived connections like Websockets and WebRTC, which have unique networking requirements. We have designed a backend architecture that allows us to cooperatively work on all the different dataflows that an Infinitus phone call needs – transcribing and recording audio, Natural Language Processing (NLP), Speech Synthesis, etc, and all within the context of a single long lived connection. In the future, we will be exploring alternate VOIP architectures as well as building better traceability to harden our system against dropped connections and packets.

After our VOIP layer, the first step of our AI pipeline is Speech to Text (STT). Given that it’s the input layer for a multi stage NLP pipeline, errors from STT can really cascade through the rest of our system. We have picked a third party STT system that provides a customized model for the type of audio we receive. We have incorporated a domain-specific vocabulary into our STT model and built a custom evaluation model that ranks the multiple generated alternative transcriptions given by the STT system. To further improve our STT layer, we plan to explore custom acoustic and language models built on our audio libraries.

Our NLP system processes the output of STT and given the context of the call thus far, decides how to respond (if at all).. Our conversational engine takes into account the most recent utterance from the other party, as well as the dialog state (inputs to call, outputs collected, and last N utterances). Our engine is a combination of multiple models that are designed for different stages of most B2B healthcare conversations. This gives our customers the flexibility to tailor the engine to their needs, while allowing us to support a wide variety of customers with a robust, scalable platform.  As a result, our NLP system has many engineering challenges like tracking long running conversation state and context. We also have traditional scientific challenges like training and evaluating different types of models for each type of context, industry and customer. As we make many more calls through our system, we are excited at the opportunity to evaluate how all of the recent advances in ML infrastructure and NLP, such as transformers, can be applied to our domains.

One of the areas we have had to invest in deeply is our data labeling platform. We have built a sophisticated set of frontend UX tools to aid us in this. Collecting structured data with appropriate labels is vital for building ML models, and when we looked at the options we had, we realized that a lot of data labeling services work great for images, and with a bit of custom work, work just OK for labeling a single utterance. However, most tools really struggle with being able to label full multi-turn conversations. In addition to being able to label data for model training, our data labeling tools have been extended to allow us the ability to debug the calls our AI system has completed in the past. This is something that has allowed us to iterate quickly to improve all of our systems.

To tie all of this together for our customers, we have designed and deployed an easy-to-use set of APIs along with customer portals to both create tasks for the Infinitus AI to perform, as well as to process the results of completed tasks. We have taken a security first mindset, and implemented a role based access control system for viewing data using our customer portal, isolated each customer’s data, and also engaged with security vendors to analyze and fix potential vulnerabilities. Another aspect of delivering value to our customers is teaching our AI systems the business rules and business intelligence our customers have learned over the years about the data these phone calls collect. While our AI can and will eventually learn these rules on its own, by directly teaching our AI, we can incorporate this intelligence from day zero. To achieve this, we will be collaborating with our customers to design rule systems and domain specific languages to validate all of the data that the Infinitus AI collects.

While it is humbling to see how much we have built with a ‘one pizza engineering team’ in the last 16 months, I am uncomfortably excited about all that we have left to do. If you would like to join our amazing team on this exciting journey, we have a number of open engineering roles  across frontend, backend infrastructure and services, and NLP.