“Thank you for calling GloBank. Para Español, oprima numero dos. What is this call regarding? For instance, you can say ‘Get my account balance.’”
“Open a new account.”
“OK, loans and payment amounts. Enter or speak your account number please.”
“No. I’d like to –”
“Sorry, I couldn’t hear that number. Please tell me your account number.”
“I didn’t understand that command. Thank you for calling ,,.”
“AGENT! REPRESENTATIVE! AGENT!”
A sad sort of human-to-computer stalemate has played out over countless fruitless interactions. Companies adopted IVR (interactive voice response) systems over the last decades, possibly in an attempt to reduce the cost of hiring and training costs for human customer service reps that have high turnover rates.
Or, perhaps some forward-thinking executives thought a robot-voiced CSR would make a company appear more ‘advanced’ in comparison to its competitors.
Whatever the reason, our earliest conversations with IVR menus and chatbots left most of us humans feeling let down, like we weren’t having a conversation at all. Despite the fact that voice recognition and computer speech have improved dramatically in speed and sophistication, it’s hard for some of us to shake that feeling that nobody is on the other end of the line to help.
Conversational AI includes a broad range of technologies that seek to make humans and computer systems work better together, by training software to understand and communicate with people using a natural language conversation as the interface.
Recognizing the conversational chasm
Early chatbot and voice systems operated like audio versions of text menus and decision trees. The system ‘agent’ asks questions of the customer, who provides responses from a limited set of options that the agent can recognize to fill out the data fields in a form, or advance the customer to the next menu of options.
While sophistication has improved, until very recently, most of our interactions with chatbots and translators weren’t powered by AI at all. Process mining and workflow tools, or dedicated interactive design and development teams could construct complex sequences of user preferences and behaviors, thereby enhancing the customer experience of any other software.
Just a decade ago in 2012, the Recurrent Neural Network (or RNN) method of AI arrived on the scene. RNN created short feedback loops for AI to learn from, and chat routines could use it to algorithmically improve responses to any query, including text and audio cues.
Text recognition blossomed into better natural language understanding (or NLU), which provided a better semantic understanding of written and spoken words – their definition, declension and tense within the language’s grammar – which is also the basis of language development in humans.
AI was finally starting to cross the conversational chasm between people and systems, but something was still missing.
From happy paths to contextual transformations
Game players and customers generally like guidance in a dialogue toward happy paths – so they are more likely to achieve worthwhile outcomes.
Think about a typical computer game and the sequential dialogue included within, leading to a limited set of phrases for the game player and character responses. While someone could program in thousands of unique dialogue options to cover all possible scenarios, the resulting gameplay would be rather boring.
Unfortunately, today’s complex, highly interconnected world has raised conversational expectations far beyond what any finite number of predefined dialogue options can encompass. Customers talk to systems, which talk to other systems and organizations – then they seek out conversational help when self-service options fail, and when problems require expertise or experience in navigating these multimodal conversation chains.
To pick up where happy paths leave off, the state of the art in Conversational AI today is considered to be in transformations. Starting in 2017, models like BERT and GPT-3 offered a step-change in scalability over previous automation methods and RNN.
Taking advantage of the parallel processing power of elastic cloud resources, researchers have trained transformers with terabytes of text to produce very plausible writing and unscripted dialogue that almost seems human. One open GPT-3 writing demo can literally finish your sentences, and generate more paragraphs for you in the same vein. The DALL-E demo showed how AI can even interpret descriptive phrases into virtually generated imagery, in any artistic style you like.
Of course, unless your organization has a supercomputer research facility at your disposal, these models will require massive investments in time and money to apply high volumes of data for your own custom business purpose.
Still, there’s a gap in conversational AI opening up that even increased budgets can’t solve. Once you increase the training data set of a transformer to billions of words and phrases (what we call Large Language Models), the AI interpretation can actually drop in accuracy due to the commonality of mistakes in public dialogues and publications.
Though it’s an earlier example from 2016, Microsoft’s chatbot assistant incident showed how some no-good pranksters could influence a conversational model toward very undesirable results.
Attention and planning
The foundation and essence of a good human-to-human conversation aren’t what you say. It’s listening. AI transformers can cross the chasm of understanding through attention.
In a real conversation, whether or not the grammar of a conversation is correct, the order of the words over time matters as much as the words themselves, especially if humans are grouping or excluding any of the subjects or objects in a sentence or paragraph.
A plan-based dialogue goes beyond natural language understanding of the word and phrase data, filling in the intent and state of the conversation with attention.
The Openstream.ai platform generates a high-performance conversation graph rather than a data store, allowing its Eva conversational AI to observe and freely associate millions or trillions of different potential word orders for contextual dialogue inputs and outputs without bogging down response times or losing understanding over time.
Author: Jason English (email@example.com)
Principal Analyst & CMO, Intellyx