Omnimodal AI: Why tomorrow’s AI will have all five senses
In the near future, we'll see an AI that can not just see and hear, but feel, smell, and taste as well. In fact, we already have the technology to do it.
Dec 14, 2023 • 6 Minute Read
Just last week, Google Gemini was released, and the news exploded: here was an AI that could “see” and “hear” us! There’s some semantic arguments there — hence the air quotes — but the gist is sound: we now have a generative multimodal AI that can understand videos, images, text, and audio.
Right now, we talk about two kinds of generative AI: unimodal and multimodal. Unimodal AI can only take one kind of input (like text), whereas multimodal AI can take more than one type of input. Right now, everyone’s trying to slap the multimodal label on their generative AI product if it can do more than just text --- it’s the new must-have feature to advertise.
However, there is a third kind of generative AI that nobody is talking about, and I believe will become a reality sooner than we think: omnimodal AI.
What is an omnimodal AI?
An omnimodal AI is a generative AI model that can take all five classic senses as input: visual, auditory, tactile, olfactory, and gustatory. It can process and interpret data from each of these modalities, allowing it to understand and interact with the world in a way that mimics human perception.
We also currently have the individual building blocks for an omnimodal AI, just like we had the pieces for multimodal AI years before these features were incorporated into Google Gemini and GPT-4: it’s just a matter of refining our application of it.
Why would an AI need to smell, taste, or touch?
Much like the saying “if it looks, swims, and quacks like a duck, it's probably a duck,” we use multiple senses to identify objects. Similarly, AI benefits from multiple senses, including smell, taste, and touch to better understand and interact with its environment.
The power of smell
Currently, an AI that can take visual input will scan what it sees, check against its training data, and give a confidence score (e.g. “I am 68.22% confident this is a pile of sugar”). However, what if we gave it the ability to smell like a sniffer dog? With that additional input, an AI might suddenly realize that it is indeed not sugar, but an illicit substance.
This is a real technology we have today, in the form of an electronic nose or e-nose — a tool used in laboratories, and process and production departments. Currently, we use it for:
Detecting contamination, spoilage, and adulteration
Monitoring, managing, and comparing materials (e.g. detecting mass, polymers, gas)
Sensory profiling of formulations and recipes
Benchmarking of competitive products
However, if we mixed an e-nose with AI, we could take things a step further. In medicine, you can detect dangerous and harmful bacteria and viral conditions via specific scents. For instance, lung cancer gives off unique organic compounds that can be detected. For environmental monitoring, it could detect volatile organic compounds in air, water, and soil samples.
Imagine an omnimodal AI acting as an oncological assistant, that could take one look at you and not only compare every mole on your skin against photos of cancers such as melanomas, but literally smell if you had signs of brain cancer. This is something that is simply beyond the capacity of a regular doctor (no matter how good their nose is).
For a less life-or-death example, an AI with olfactory senses could smell when a room is unpleasant, and provide pleasant floral scents to mask it.
The power of touch
Touch is one of the primary ways we understand the world, and the same benefits would apply to an AI. We can identify an object simply by running our hands around it: is it hard or soft, cold or hot, round or square?
Again, this is a technology that already exists. Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have already come up with a predictive AI that can learn to see by touching, and learn to feel by seeing.
“By looking at the scene, our model can imagine the feeling of touching a flat surface or a sharp edge”, says Yunzhu Li, CSAIL PhD student and lead author on a new paper about the system.
“By blindly touching around, our model can predict the interaction with the environment purely from tactile feelings. Bringing these two senses together could empower the robot and reduce the data we might need for tasks involving manipulating and grasping objects.”
Another form of this technology is e-skin, which is an electronic skin stretched over a machine to detect proximity, heat, moisture, and touch interactions.
The power of taste
As humans, we love taste. Take a look at how many restaurants, bars, and fast food outlets are near your home. How many products do we have on the supermarket shelves that are nutritionally similar, but taste completely different? It’s a big market, and an AI that could taste could revolutionize it.
By comparing taste data against consumer buying patterns, an omnimodal AI could create new recipes or flavor combinations. It could also analyze flavor aging in beverages (such as wine), make better tasting medicines, and generally monitor biological and biochemical processes.
And yes, you guessed it: this technology exists, and it’s called an electronic tongue.
The power of sound
We currently have computers that can take audio input, like voice assistants, but what about when we combine this with generative AI, and take this technology to the next level? Things like AI that can analyze subtle nuances in voice patterns become possible, assisting in customer service or detecting health issues.
For instance, let’s take someone who has a slight respiratory problem — this is something such an AI might be able to diagnose. It could also process distress signals, background noises, or other audio cues during emergency calls to assist in quicker and more accurate responses. There are a lot of sounds that we as humans can’t hear, or are too distracted to detect.
The power of sight
In medical imaging, an AI with strong visual capabilities could sense anomalies easily missed by the human eye. In security, it could monitor video feeds with a level of detail and persistence a human couldn’t match, and respond appropriately. In research, it could identify microscopic patterns and analyze them in real time — this would also be beneficial for detecting defects in manufactured products.
By our powers combined…
The real benefit of an omnimodal AI isn’t just the individual senses, but the combination of these to paint a more detailed picture of the world — perhaps one that as humans, we couldn’t achieve with our limited senses. That’s not to say an omnimodal AI would necessarily replace us, as we have access to cognitive reasoning skills and other capabilities that make us unique, but it would certainly be assistive.
Conclusion: It’s not a matter of if, but when
Things are moving at the speed of light right now in the field of AI. I wouldn’t be surprised if someone were to announce an omnimodal AI in the next five years. We may see a low-latency omnimodal AI that acts as a digital assistant, responding in real time via a speaking avatar that is not pre-animated in any way. We might also see an omnimodal AI put into a small chassis, like a version of ElliQ with arms and wheels. Really, the stuff we’ve been promised since Asimov started writing about it in the 1940’s.
All of it sounds like science fiction, but so too would ChatGPT if you explained it to someone from 2021. We’ve got the technology, all we need to do is work on it. How long before someone puts it all together and makes it commercially available?