Skip to content

What is Multimodal Conversational AI?

Entrusting your brand’s reputation to virtual agents that are brought to life through AI can be a leap of faith for many organizations. Like all technologies before it, your personal experiences as an end-user, and what you’ve heard from trusted colleagues or read from publications set a bar. And until the recent proliferation of ChatGPT, which hails from a class of AI called Generational AI, not enough businesses had considered just how powerful an experience engaging with an AI-powered conversation can feel.

As we covered in our recent blog, ChatGPT isn’t ready to be put on the front lines to your customers. But enterprise-class Conversational AI is. Although not all Conversational AI platforms are created equal. To truly deliver on the requirements of enterprise-class conversational AI, it is essential to seek out platforms with multimodality in their DNA vs. bolted-on. But what is it and why?

The spectrum of Conversational AI spans from simple ‘plug and play’ scripted chatbots to sophisticated enterprise-class Conversational AI virtual agents. The former you’ve likely encountered in your travels on any number of B2C and B2B websites (think ‘What brings you here today?’ prompts on a website). The latter you might not have noticed or perhaps engaged with personally as truly enterprise-class Conversational AI platforms aren’t as ubiquitous or easily identified.


Multimodality, in the context of enterprise-class Conversational AI, is the ability of the platform to seamlessly engage with its users through any number of modes of communication. The modes themselves that need to be accommodated, understood, and ultimately acted upon represent a spectrum of continually vacillating interactions woven together and as varied as human conversations. Instrumenting a conversational AI to portray ‘human-like’ experiences for users is both complex and critical-path for organizations that aspire to have their Conversational AI be the embodiment of their brand throughout user journeys across omnichannel experiences.

Embracing Multimodality in Real Life

Consider something as simple as ordering a coffee from your favorite barista. Beyond the basic mode of engagement, selecting how to place the order – going through the drive-in, using an app on your phone, or walking inside there are many dimensions to consider that affect the quality of your experience and the ability of the brand to engage with you and help you to achieve your goal while making that brand a bit of profit and doing so as efficiently as possible. Let’s consider the scenario of walking inside.

You like this shop and the barista inside because whenever you come in they acknowledge you quickly, ask you about your day, and the weather outside, and remembers that the last time you came in that you started a new job recently and ask ‘how’s the new job treating you?’ with a smile. You counter with some small talk gesturing to the window at the rain falling outside.

When you make eye contact the barista knows you are ready to engage with them to order. Today you are in a bad mood and the barista can tell because of the tone of your voice, the answers you just gave, and the expressions on your face as you speak. Of course, because today’s crazy you’d like a snack and point to something in the display case and ask ‘How much is that sandwich?’. There are three different sandwiches to choose from, yet the barista can tell what you’re pointing at and tells you the cost. You pay with a credit card to rack up more loyalty points and leave a tip with cash because you plan to be back tomorrow and then walk out.

Adapting Multimodality to Conversational AI

In this scenario, the very natural human interactions between customer and barista are well beyond binary inputs of voice, text, images, and their outputs. The interactions are seamless with underlying catalysts that nobody really thinks about but they are in fact, quite human. They just happen.

You wanted a coffee. The store had the plan to sell you that coffee and offer additional items in a pleasant environment. They do so to earn the right to be considered recipients of your loyalty, trust, and money. Done well, you’ve enjoyed a good coffee and snack and had an experience that brings you back over and over again. You don’t think about how you pointed at that snack or why you frowned and made eye contact as components of your engagement. The magic of what humans take for granted in day-to-day conversations is still intact. Done poorly the man behind the curtain is exposed.

Endowing an enterprise-class Conversational AI like Eva™ with multimodality in our DNA brings to bear an innate understanding of the user and how they have engaged with a business at a particular moment and for the organizations who leverage it. The method or channel that a user is choosing to engage in impacts what is possible and when. Enterprise-class Conversational AI refines the dimensions of communication that are possible and interprets the modes of communication being used while relaying sensical, empathetic, and natural responses to help the user achieve their goals.

Why is Multimodality Important to Enterprise-Class Conversational AI?

Combining modalities reduces the time needed to convey intent (think about the act of pointing, smiling, and saying out loud ‘I want that’ within one moment). Further, harnessing the traits of multimodality increases mutual disambiguation across the modalities. Simply said, the sum ends up being greater than the parts. This allows the virtual agent to infer what you are trying to accomplish from what it can infer from the conversation. Properly harnessing multimodality makes the conversations smarter while feeling more natural and provides more insights back to the organizations that embrace it.

Conversational AI does not dictate what ‘slot’ a user has to fill in to get from point A to point B in their dialogue. The user is in control of how, what, where when, and even why they are engaging the enterprise-class conversational AI agent and the combination of modalities they choose to use (consciously or subconsciously) to communicate.

Always-on Multimodality can’t just be a bolt-on to support human-like interactions. It needs to be woven into the fabric of enterprise-class Conversational AI. Users (people) want to engage with a company in a manner that’s most comfortable and convenient for them. People sense and can get frustrated with adapting to their approach to communication. Anyone who’s ever used an IVR (Interactive Voice Response) platform and whammed on a ‘0’ to speak to an operator knows this plight firsthand.

Multimodality is the catalyst to provide more personalized, contextually aware, and frictionless human-like responses. Delivering natural, seamless, and consistent interactions across all touchpoints improves overall customer satisfaction and loyalty for users. Businesses benefit by improving productivity and yield while deriving actionable intelligence at scales previously unfathomable.