Skip to content

Multimodality Enables More Effective Collaboration with AI

When someone says ‘Conversational AI’ to you, the challenge is figuring out what the heck that means especially since today, when people hear ‘AI’ they leap to Generative AI thanks to the proliferation of things like ChatGPT. AI has been around for a long time and will continue to build momentum as a tour de force for decades to come. 

Within the realm of AI, Conversational AI (CAI) can be something as simple as a plug-and-play chatbot on a website to a more sophisticated virtual agent suitable for large enterprises. Enterprise-class CAI platforms aren’t as obvious as common chat is on a website, and are not limited to basic text interactions. CAI enables a human to have a conversation with a digital entity (i.e., an enterprise) – in a very human-like manner by using natural interaction modes. Technically this is called multimodality – which refers to the multi-sensory abilities we all have.

We all want to have Star Trek-like experiences interacting with these digital entities. That’s one of the reasons we all went out and bought devices that reply to simple commands like ‘Hey Google’ or ‘Alexa’ after all. The problem is that we’ve all been conditioned by scripted chatbots and the IVR decision trees of call centers to adapt to their interfaces. We have been trained to ‘push 1 for support’ or resort to typing phrases, that the bot hopefully understands, to try to get the answer we need.

This conditioning continues with nearly all Generative AI interfaces and permeates our expectations of how to interact with the AI system. We are typing in a 1980’s style terminal interface to get answers. Typing! Further, to get the most from these Generative AI platforms, humans are forced to craft prompts for these systems in a manner the system understands. The system – not you. 

Multimodality is Key to AI and Human Dialogue

You, you darn human, shouldn’t need to adapt to any channel, mode, or language just to have a conversation with an enterprise’s digital human entity. Set your expectations higher! Let’s put the burden of understanding, empathizing, engaging, and reacting to your needs on the shoulders of CAI and while we’re at it, do so in a highly ethical and transparent manner.  

Multimodality isn’t so much a ‘thing’ but rather a very sophisticated, highly tuned set of capabilities and AI disciplines that work together to enable a CAI to be more ‘human’. It endows the CAI with the ability to communicate with and to understand the humans it’s trying to have a conversation with - in the manner they choose - across any channel or language. As human beings, we don’t think about ‘modes’ when we’re talking with another human. Rather, we just have a conversation and don’t even think about how we’re relaying information to one another throughout that discussion. It just ‘is’ - and that’s what multimodality facilitates in digital entity-to-human conversations.

Specifically, a multimodal CAI can understand the human user regardless of the mode or channel being used at any given time during a conversation. The CAI deciphers your intention, what you’re saying, how you’re saying it, and more. It devises a plan to help you achieve your goal based on what it thinks you want to do next and will tailor its responses accordingly. The more naturally it can have a conversation with you, the easier it is for you to convey your intentions with the CAI and vice versa. After all, dialogue is a two-way street. 
Consider the last time you had a virtual meeting with a colleague or a friend. The part where you were planning something together. You spoke. They listened. And not only did they hear you - they understood your language, and reacted to your tone, emotion, and more. They parsed what you were saying ‘on the fly’ and understood what you meant. And if they didn’t understand you, they asked questions to dig deeper to gain that understanding. 

Another dimension to help interpret this conversation for both parties was the visual cues. They watched you. You watched them. And subconsciously you each responded to and interpreted what each of you were doing while speaking to one another. Each could tell if the other was paying attention, assess moods, and observe the use of gestures to underscore a moment or point at something. 

If you chose to Slack, SMS, or WhatsApp that person later in the day, you’d likely expect them to remember what you both discussed and continue your conversation. And, because the channel allowed it, and it helped you to communicate, you might attach some photos, audio files, links, etc. all leading up to trying to coordinate a future face-to-face meeting together. Perhaps the dialogue was like this:

  • “Hey, let’s meet next week.  I’d love to see you.”
  • “Great, what works best for you?”
  •  “Oh, you know, later in the week. How’s Thursday or Friday?”
  • “Sure, I can do that. Maybe Friday sometime?”
  • “Yeah ok, mornings are bad. How’s 1?”
  •  “Great. See you at the coffee shop?”
  • “Sure.”
Throughout this interchange, you reasoned with one another, and you were each able to explain what was said and why.  Even coordinating the rendezvous required a complex, subconsciously interpreted conveyance of instructions, rules, and parameters that are simple for a human mind to decipher and yet, oh-so complex to be implemented within a CAI platform. 

Consider what’s needed when filing a claim with an insurance company representative. As a human you may just want to take some pictures or walk around your basement and point to things ‘See this, it’s damaged’. Or ‘Look, here’s where the pipe broke’. You’d expect a human agent to know what you were pointing at while you had the conversation. Further, you’d expect the human to be fully trained, to know about your policy, and the coverage you had, and to carefully guide you through the claims process to help you during your time of need. And you should expect no less from a virtual representative. 

This is why multimodality and the ability of an embodied virtual assistant to collaborate and converse with a human as naturally as possible is important. You don’t have to stretch your imagination too far to imagine this scenario without CAI (please enter your account number, are you trying to pay your bill? Push ‘4’ to go back to the main menu, etc.). We’ve all had bad experiences. But they don’t have to be that way with virtual assistants.

Imagine better experiences by just being able to have a very natural feeling conversation with many of the companies you interact with regularly, both as a customer and as an employee. With multimodal CAI, enterprise digital entities and their personas can understand and engage with you anywhere, anytime without forcing you to adapt to them. Rather, you get to be you - a human. 

Note: This article was originally published on Spiceworks and has been updated by the author.