The Replicant Labs series pulls back the curtain on the tech, tools, and people behind the Thinking Machine. From double-clicks into the latest technical breakthroughs like Large Language Models to first-hand stories from our subject matter experts, Replicant Labs provides a deeper look into the work and people that make our customers better every day.
Tom Sherman, Manager, Engineering
Humans are really good at having conversations. In the first few years of our lives, we learn all the tricks and nuances, and the result is a natural, intuitive, fluid process for exchanging information. We don’t think about “how” to have a conversation.
For a Thinking Machine to have a fluid conversation with a person, we need to understand and mimic the conversational capabilities of a human being. We break down the seemingly simple process of conversation into all of its tiny, constituent parts.
At each step of the way, little things can go wrong that ruin the fluidity of the conversation.
What does a single interaction’s lifecycle look like?
Let’s assume that a Thinking Machine answers the phone and asks the caller how it can help.
- The caller says: “I moved and I haven’t been getting my bill.”
- And the Thinking Machine responds: “Okay, I can help you with that.”
Let’s walk through the steps that are necessary to resolve this seemingly simple back-and-forth and prevent it from going off the rails:
“Hearing” the Caller’s Voice. First, we have to “hear” the audio. (For now, we’ll gloss over how telephone calls are made over the internet and how a Thinking Machine actually answers the phone.) We receive the audio packets streamed to us from the telephony provider and in turn stream those to speech-to-text software, also called an automatic speech recognition (ASR) system.
The ASR streams back transcripts of the text uttered, including partial and incomplete transcripts. Given more time and more words, we can be more confident of what was said.
We also need to know when the user is done speaking. Humans make this decision easily and unconsciously, keying in on pauses, tone, facial expressions, and body language.
On the phone, using software, we can go by the words that have been uttered and silence in the audio. We can decide that an utterance is complete by using dedicated software, or we can cede this decision to the ASR system.
What happens if we make a bad decision? A machine might cut a caller off if it moves too fast, not hearing everything that was said or interrupting their speech. But if we wait too long, the conversation feels stilted and strange, akin to a newscaster interviewing a foreign correspondent via satellite.
Let’s assume we get it right and properly judge when the human is done speaking. We have a transcript of the utterance from the ASR. But is it correct? What if the caller has the TV on in the background or is talking to their spouse during the call? Did we even transcribe the right voice?
What if we mishear the caller? We might be very confident that the caller said “I moved and I haven’t been getting my bill,” or we might not be so sure. When do we decide to re-prompt the caller by asking for them to repeat what they said? Everyone’s had the experience of automated phone systems asking for a repeat, and it quickly gets annoying!
Understanding the Caller’s Words. This time, the transcript is correct. The caller indeed said “I moved and I haven’t been getting my bill.” What does that text mean, in this context? Can the Thinking Machine help in this situation?
For that, we need to detect the intent of the utterance. To make this inference, we invoke a machine learning model that takes in arbitrary text (“I moved and I haven’t been getting my bill”) and returns a set of probabilities, each associated with a task or subject area.
For example, the model might respond that with a confidence of 79%, this utterance relates to the topic of mailing addresses. 79% sounds pretty confident, but what about 70%? 60%? 45%? When do we ask the caller to repeat, clarify, or rephrase?
What if there are two relatively high probability options: do we ask the caller to disambiguate, i.e., choose between the options? This is the art and science of conversation design for humans speaking to machines.
Speaking back to the Caller. The caller does seem to be inquiring about their mailing address, and the Thinking Machine is capable of helping with that task. So it needs to “say” as much to the caller, but typical software programs can’t speak. So we use a text-to-speech (TTS) system to synthesize the text.
TTS systems take in a string of text (“Okay, I can help you with that.”) and return an audio file or stream of data. But just having the audio is not enough; we have to play it!
To do that, we send a command to the telephony provider to play the file we just made. The provider takes our audio file and streams the sound to the caller.
What if it takes a long time for the TTS system to generate the audio, or the telephony provider is slow to play the sound? The conversation becomes laggy and awkward. In fact, latency at any step ruins the fluidity of the conversation. Our systems can’t just work. They need to work in real-time.
Ensuring Conversational Fluidity. As we’ve learned, many little things can go wrong when a person talks to a machine. This overview offers just a peek into the infinite nuances of building a Contact Center Automation platform (remember, the above example looks at just a single interaction in a conversation that can have dozens more).
That’s why we build complex systems to monitor the health and quality of the conversations on our platform. We’re always analyzing and improving, ensuring software is making the right decisions, and doing so quickly, so that conversations with Thinking Machines are always more fluid and natural than anything customers have experienced before.