VOICE22, which took place in Arlington from October 10th to the 12th, was a memorable event that provided valuable insights into the Voice industry. One of its more eye-opening moments came with Openstream CEO Raj Tumuluri's presentation on the state of conversational AI and the lessons learned from being on the frontier of conversational AI (CAI) development. And it merits a short recap here.
Raj’s talk was far-reaching. This post highlights some of the elements that stood out for us. If you’d like the full story, you can view his presentation as well as all of the VOICE22 sessions On Demand here.
Raj opens by painting the portrait of conversational AI as it is today. While tremendous inroads have been made to advance conversational AI's sophistication and capabilities, there are still many challenges. As Raj details the current state of CAI, he highlights some of the challenges and provides high-level solutions.
In his presentation, Raj examines the knowledge bases and semantic parsing and explains that they have considerably evolved in recent times. However, even with the enhanced state of semantic parsers, similar syntactic patterns often introduce different semantic interpretations. He exemplifies this with the classic example: "I broke the glass" vs. "The hammer broke the glass." The first statement is about a person, and the latter is about an instrument. Knowing the difference between the two makes all the difference in the world regarding CAI.
Raj makes it clear that CAI systems need to mimic the fundamental "brain behaviors" of human intelligence. And that means perception, attention, thought, memory, knowledge formation, judgment and evaluation, problem-solving, and decision-making. It seems like a big ask, right? Sure. But Raj demonstrates these capabilities in action by sharing a demo of an AI avatar handling a call by a human who has had their phone stolen, and the results are very impressive.
The avatar demonstrates its ability to handle the human's multi-intent speech. It is able to break down human speech (sentence-splitting) into actionable chunks in a way that remains entirely relevant to human intent. It also keeps the human's state in memory to regulate its tone and avoid asking the human to repeat themselves - both of which are critical to fostering trust with the speaker. The system also knows how to react in a multi-turn conversation, pausing when relevant and providing explanations on cue.
Raj asserts that CAI needs to be weaned off its dependence on atomic statements. Human beings don't typically speak that way, so AI needs to follow - not the other way around. And this kind of speech interpretation requires those "brain behaviors" we mentioned above (perception, attention, thought, memory, knowledge formation, judgment and evaluation, problem-solving, and decision-making).
As Raj points out, current deep learning techniques, which tend to rely on big data and the vector space model, work well for atomic input, they generally fail as soon as you introduce constraints in the input. He provides us with the example of someone wanting to book a flight. Say a human stated their request in the following manner:
"I want to fly to Boston from Washington, on Wednesday, at 5 PM."
That's easily reducible to its atomic elements:
And AI will successfully provide the user with flights meeting those conditions.
But what if the user expressed themselves this way:
"I want to fly to Boston after Wednesday but before the weekend - and ideally on a flight that doesn't take off too late in the day."
The AI will need to take a step back and interpret that correctly so it can ask a minimal amount of pertinent questions to the user and produce a successful outcome. On top of that, the AI shouldn't wait to be asked a question to provide an explanation - it should be proactive while offering a rationale for its utterances - as a human would. And it should also inform the customer about any preconditions relevant to the situation. For example:
Before I can display some flight options, I need to know from where you will be flying to Washington. I also need to know if 3 PM is too late for you."
The way forward, according to Raj, is for AI to come closer to "thinking" like a human - to make artificial intelligence more, well… intelligent.
The presentation's goal was two-fold. First, take the pulse of the current state of conversational AI. Second and more importantly, to sketch a way forward for CAI to become more "aware" with its own sense of understanding. And that heightened "understanding" will enable the tech not just to do more but to "think," understand, and relate more while simultaneously fostering deeper trust with the user.
We look forward to seeing what innovative tech Openstream has in store for us moving forward.
The future is multimodal.
If you’d like access to all of the talks and events from VOICE22, they’re available On Demand here.
Modev was founded in 2008 on the simple belief that human connection is vital in the era of digital transformation. Modev believes markets are made. From mobile to voice, Modev has helped develop ecosystems for new waves of technology. Today, Modev produces market-leading events such as VOICE Global, presented by Google Assistant, VOICE Summit the most important voice-tech conference globally, and the Webby award-winning VOICE Talks internet talk show. Modev staff, better known as "Modevators," include community building and transformation experts worldwide. To learn more about Modev, and the breadth of events and ecosystem services offered live, virtually, locally, and nationally - visit modev.com.