The Multimodal Approach: Explained
“Our intuition tells us that our senses are separate streams of information. We see with our eyes, hear with our ears, feel with our skin, smell with our nose, taste with our tongue. In actuality, though, the brain uses the imperfect information from each sense to generate a virtual reality that we call consciousness. It’s our brain’s best guess as to what’s out there in the world. But that best guess isn’t always right.” – Dr. David Ludden Ph.D.
The quote above comes from Dr. David Ludden, Ph.D and professor of psychology at Georgia Gwinnett College. In this excerpt, Dr. Ludden emphasizes the fact that human perception is indeed, subjective.
Furthermore, Ludden explains how the brain uses a multimodal approach (multiple senses) to better perceive external scenarios and draw conclusions.
What is Multimodality?
Multimodality- a term that is slowly but surely infiltrating our everyday lexicon. But what does it actually mean, and where does it come from?
Derived from the latin words ‘multus’ meaning many and ‘modalis’ meaning mode, multimodality, in the context of human perception, is simply that- the ability to utilise multiple sensory modalities to encode and decode external surroundings. When combined, they create a consolidated, singular view of the world (source).
This is not a new idea. In fact, it stems back to our earliest ancestors, hundreds of thousands of years ago. In the primate world, according to Comparative Psychologists, language itself is rooted in multimodal origins. They believe that communication is segmented into three modalities including: vocalizations, gestures, and facial expressions.
Consequently, it could be said that the multimodality approach to perceiving the outside world is, in fact, human nature. So in the context of technology, how then are machines replicating this human-associated, innate ability? What is Multimodal AI?
Multimodal perception transcends the world of technology. When applied to Artificial Intelligence specifically, combining multiple AI data sources into one model is known as Multimodal Learning (source).
That said, the multimodal approach to human perception has evolved over time, resulting in a more complex (and powerful) understanding of the world around us.
At the same time, this approach can be applied to technology’s ability (specifically the evolution of AI algorithms) to evolve and recognise scenarios using multimodal AI recognition on a digital sphere.
Machines Delivering the Truth Via Elimination of False Positives
Going back to the idea of understanding the truth via human multimodal analysis, the same can be said for machines. There is an interesting parallel between “Truth” and AI’s “False Positive”, or the truth as provided by a computer. More specifically, every Cognitive Algorithm provides a “confidence score”. The higher the score, the higher it can be trusted…the more it correlates to the truth.
Today, machines are closer than ever to replicating human perception of the external world. The catch? Mainstream machine learning or machine perception is more closely related to a human dream. Many machines are only programmed to recognise one mode and missing “the big (multimodal/multi-sensory) picture”. There are often floating elements, with an unclear idea of what is going on.
Without the component of cross-variance multimodal perception, eliminating false positives, and adding context, the pursuit for truth is lost in translation.
The big idea here is that false positives must be accounted for (i.e. a machine understanding that a picture of a person is not necessarily the real-life person). This is the idea of advanced, multimodal machine learning.
Examples: Multimodal Technologies
According to the European Language Resources Association, multimodal technologies refer to “all technologies combining features extracted from different modalities (text, audio, image, etc.).” Sound familiar?
Some such examples would be:
- Automatic Speech Recognition (Audiovisual).
- Person Identification (Audiovisual).
- Event Detection (Audiovisual).
- Object or Person Tracking (Audiovisual).
- Head Pose Estimation.
- Gesture Recognition.
Just as we have established that human perception is subjective, the same can be said for machines. In a time when machine learning is changing the way humans live and work- AI, using the multimodal approach, is able to perceive and recognise external scenarios. At the same time, this approach replicates the human approach to perception, that is to say with flaws included.
More specifically, the benefit is that machines can replicate this human approach to perception of external scenarios. Not only that, but certain AI technology can perceive information up to 150x faster than a human (in parallel with a human gatekeeper).
With this new development, we are getting closer to mimicking human perception, and well…the possibilities are endless.
Limitations: Interpretation is Flexible not Fixed
While it is well known that humans possess this extraordinary ability to encode and decode complex real-life scenarios via a multimodal approach, there are still limitations.
Despite our innate and advanced ability to recognise a variety of objects, situations and people- the world around us is not always how it appears (feels, sounds, etc.).
As Psychologist D.R. Proffitt puts it:
“Perception is not fixed: it is flexible.” – Source
As humans, we naturally create subjective perception. Evolutionarily, our brains were not programmed to experience senses individually. Therefore, as previously stated, we naturally experience multi-sense perception (i.e. visual and audio) to mold a more accurate consciousness of the world.
Human Psychology: Multimodal Factors of Perception
In truth- what humans perceive in the external world is actually a direct reflection of our corresponding psychological state. That is to say, humans can perceive external scenarios in extremely different ways depending on a multitude of factors. Such components include but are not limited to: memories, past experiences, culture, gender, age, interests, education, etc.
Just as these cognitive applications influence human perception- the same can be said for machine learning and its associated “learned” cognitive applications.
If we want to understand and, in turn, communicate the truth, or what we deduct to be the truth based on a global, multimodal view, we have to take multiple factors into account. This is something our brain does automatically.
For the sake of simplicity, let’s look at the multimodal factors of perception and communication that we use most often as humans:
3. knowledge/learning (context)
Things are not always as they appear: Example, Greta Thunberg
When missing one mode of perception, things can get tricky. Take the following example, for instance:
To the average eye, Greta Thunberg looks and sounds like an average 16 year old, and if you didn’t know her or her story you may think she lived an average life.
In fact, if you watch the news- you probably do recognize this swedish teenager who became one of the most prominent environmentalist activists of her time after initiating several protests outside Swedish parliament in 2018. Her message was clear and concise: to call for more aggressive action against climate change on an international level.
Here is a video in which Thunberg warns viewers about the impact of climate change.
Seemingly overnight, Thunberg spearheaded the “School Strike for Climate Change” which garnered global attention. Students from all around the world took part in Fridays for the Future, in which they skipped school to protest for a better world with a smaller collective carbon footprint.
After learning about her initiatives via contextual international news commentary, the face of this swedish teenager took on an entirely new role. Thunberg is now a symbol- the face of revolution and combating climate change on the largest scale ever known to mankind.
This example proves that sometimes:
- What we see can change what we hear
- What we hear can change what we see
- What we know contextually can change what we see and hear
Stay tuned for our next article which will dive deeper into Multimodal AI Application via Confidence Levels.
Newsbridge is a cloud-based solution offering video indexing tools based on Multimodal AI contribution for publishers and broadcasters.
Taking into account facial, object and scene recognition with audio transcription and semantic context, Newsbridge provides unprecedented access to content. Whether it be derushing, archiving or investigative research- the solution allows for smart media asset management.
Today our platform is used by journalists, editors, TV Channels, documentarists, production houses and sports federations in contribution and post-production workflows.