When someone says ‘Conversational AI’ to you, the challenge is figuring out what the heck that means especially since today, when people hear ‘AI’ they leap to Generative AI thanks to the proliferation of things like ChatGPT. AI has been around for a long time and will continue to build momentum as a tour de force for decades to come.
Within the realm of AI, Conversational AI (CAI) can be something as simple as a plug-and-play chatbot on a website to a more sophisticated virtual agent suitable for large enterprises. Enterprise-class CAI platforms aren’t as obvious as common chat is on a website, and are not limited to basic text interactions. CAI enables a human to have a conversation with a digital entity (i.e., an enterprise) – in a very human-like manner by using natural interaction modes. Technically this is called multimodality – which refers to the multi-sensory abilities we all have.
We all want to have Star Trek-like experiences interacting with these digital entities. That’s one of the reasons we all went out and bought devices that reply to simple commands like ‘Hey Google’ or ‘Alexa’ after all. The problem is that we’ve all been conditioned by scripted chatbots and the IVR decision trees of call centers to adapt to their interfaces. We have been trained to ‘push 1 for support’ or resort to typing phrases, that the bot hopefully understands, to try to get the answer we need.
This conditioning continues with nearly all Generative AI interfaces and permeates our expectations of how to interact with the AI system. We are typing in a 1980’s style terminal interface to get answers. Typing! Further, to get the most from these Generative AI platforms, humans are forced to craft prompts for these systems in a manner the system understands. The system – not you.
Multimodality is Key to AI and Human Dialogue
You, you darn human, shouldn’t need to adapt to any channel, mode, or language just to have a conversation with an enterprise’s digital human entity. Set your expectations higher! Let’s put the burden of understanding, empathizing, engaging, and reacting to your needs on the shoulders of CAI and while we’re at it, do so in a highly ethical and transparent manner.
Multimodality isn’t so much a ‘thing’ but rather a very sophisticated, highly tuned set of capabilities and AI disciplines that work together to enable a CAI to be more ‘human’. It endows the CAI with the ability to communicate with and to understand the humans it’s trying to have a conversation with - in the manner they choose - across any channel or language. As human beings, we don’t think about ‘modes’ when we’re talking with another human. Rather, we just have a conversation and don’t even think about how we’re relaying information to one another throughout that discussion. It just ‘is’ - and that’s what multimodality facilitates in digital entity-to-human conversations.
Specifically, a multimodal CAI can understand the human user regardless of the mode or channel being used at any given time during a conversation. The CAI deciphers your intention, what you’re saying, how you’re saying it, and more. It devises a plan to help you achieve your goal based on what it thinks you want to do next and will tailor its responses accordingly. The more naturally it can have a conversation with you, the easier it is for you to convey your intentions with the CAI and vice versa. After all, dialogue is a two-way street.
Consider the last time you had a virtual meeting with a colleague or a friend. The part where you were planning something together. You spoke. They listened. And not only did they hear you - they understood your language, and reacted to your tone, emotion, and more. They parsed what you were saying ‘on the fly’ and understood what you meant. And if they didn’t understand you, they asked questions to dig deeper to gain that understanding.
Another dimension to help interpret this conversation for both parties was the visual cues. They watched you. You watched them. And subconsciously you each responded to and interpreted what each of you were doing while speaking to one another. Each could tell if the other was paying attention, assess moods, and observe the use of gestures to underscore a moment or point at something.
If you chose to Slack, SMS, or WhatsApp that person later in the day, you’d likely expect them to remember what you both discussed and continue your conversation. And, because the channel allowed it, and it helped you to communicate, you might attach some photos, audio files, links, etc. all leading up to trying to coordinate a future face-to-face meeting together. Perhaps the dialogue was like this: