Understanding Multimodal Interactions

by | Jan 22, 2019

Voicify was built to support multimodal interactions from the beginning. We understand the power of delivering content and managing an experience that involves more than one human sense. Not only does Voicify support multimodal, but it offers automatic cascading device management automatically as well as custom modal controls for each interaction within the system; offering both freedom and efficiency for you to choose from.

But let’s back up and understand what multimodality is.  To do that we must understand modality in general.

Wikipedia tells us a modality is the classification of a single independent channel of sensory input/output between a computer and a human.

We also know there are five ‘traditional’ senses that humans have: sight, hearing, taste, smell, touch.  Mapping that against modern technology three of the five senses can be delivered content/experience: sight, hearing and touch.

Applying this to the evolution of voice, these modalities are delivered consistently no matter the assistant; visual and audio and tactile output & audio and tactile input.

Breaking that down further: a device is capable of speaking, showing visuals and issuing haptic output (think apple watch vibrations), thus supporting three senses respectively.

A user is capable of supplying input through audio (speaking) and touch (usually hands and fingers to a screen).  Visual input (facial response, body motion) seems to be a natural progression, but is not currently in widespread use.

So multimodal is the idea of being able to marry two or more senses through input and output experiences. It’s important to note that not all ‘smart devices’ are multimodal, or need to be.

An Amazon Echo or Google Home Mini are both unimodal, or only offering the audio input/output.  Though the Alexa and Google Assistant are capable of delivering content to support other senses, the devices themselves are not capable.



The Amazon Echo Show, Google Home Hub and most modern car interfaces are capable of multimodal, but not for all input/outputs. They each support audio input/output, visual output and touch input.




Most phones and watches are capable of (introducing a new term) “omnimodal” experience with virtual assistants.  This means that based on current accepted modalities, these devices support them all.  They can use audio, visual and haptic inputs and outputs (speakers, screens, keyboards, vibrations). We will see more and more cameras being used for visual input, further widening the exponential combination of how these devices understand us.


The landscape of devices and supported modalities can be a challenge for anyone creating content and engagement for voice first devices.  A key driver of the Voicify Voice Engagement Platform is to reduce that complication for brands and management teams.  As new modalities evolve, so too will the platform.

We love this stuff!!!

By 2020, 30% of all search queries will be conducted without a screen


Need support? Want to give feedback? Learn about the path we are blazing? Whatever it is, we’re here to help.