Google launches Gemini, an AI that takes video input

Just as 2023 was winding down, Google released a new natively multimodal AI that can understand videos, images, text, and audio.

By Adam Ipsen

Apr 15, 2024 • 4 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

If you cover AI news, November was a crazy month — not only was there all the madness of OpenAI’s DevDay and a slew of new ChatGPT announcements, then Amazon decided to up their AI ante with Amazon Q. I love innovation, but as December started, I was hoping for a quiet breather.

And then Google went and dropped Gemini.

What is Google Gemini?

Gemini is a multimodal AI model from Google that can understand not only text, but video, images, and audio. It can also understand and generate code, and generate text and images combined. It comes in three versions depending on your processing requirements: Ultra, Pro, and Nano.

Another cool feature of Gemini is it can understand languages visually. For instance, if you show it a camera feed of a music score with Italian notation, it’s able to understand what this means and explain it back to you.

Which one is better? Gemini vs GPT vs Claude

Google claims its Gemini Ultra narrowly outpaces GPT-4 in most categories such as math, code, and multimodal tasks. For instance, it is better than GPT at math by 2%. However, this research lacks comparison with OpenAI's superior GPT-4 Turbo. There are currently no comparison studies with Anthropic’s Claude 2.1.

Google states Gemini is the first model to outperform human experts on MMLU (Massive Multitask Language Understanding), which is a test that asks questions in 57 subjects such as STEM, humanities, and others. In this area, it got a score of 90% vs GPT-4 at 86.4%.

However, anecdotal reports by users have been lukewarm to say the least, citing frequent hallucinations and translation errors (as well as some questions about the demo videos). A clearer picture of Gemini's capabilities will come out over time, once there's been time for independent research to be done.

Gemini is more multimodal than GPT and Claude

In terms of being multimodal — being able to understand multiple types of input — Gemini is currently ahead of the pack. It can natively take video, images, text, and audio as input. In comparison, GPT-4 with Vision (GPT-4V) accepts image and text, and Claude 2.1 only takes text input. Gemini can create images, and with access to DALL-E 3, GPT-4V can as well.

Gemini has a smaller memory, produces significantly less output

Gemini’s token window is significantly smaller than both Claude and GPT-4 Turbo: Gemini has a 32k token capability, GPT-4 Turbo has a 128k token window, and Anthropic has a massive 200k token window — the equivalent of about 150k words, or 500 pages of text. Tokens are generally an indicator of how much information a model can remember and produce.

The latency of Gemini is still unknown

One thing that is a big factor with AI models with shiny new features is latency --- when GPT-4 came out, it provided a lot better outputs than GPT-3.5, but at the cost of speed. Clearly Google is offering three different versions of Gemini to offer lower latency options at the expense of capabilities, but how these stack up against other models has yet to be seen. Again, this research will only be a matter of time.

How do I use Google Gemini AI?

Google Bard now uses a fine-tuned version of Gemini Pro behind the scenes, and it is also available on Pixel. Google plans to bring it to Search, Ads, Chrome, and Duet AI in the next few months. For developers, Gemini Pro will be available from December 13 via the Gemini API in Google AI Studio or Google Cloud Vertex AI.

Google has said Android developers will soon have access to Gemini Nano via AICore, a new system capability available in Android 14. Gemini Ultra is still being fine-tuned and tested for safety, with expected release early 2024.

Conclusion: A big step in multimodal AI input

While Gemini’s on-paper capabilities don’t blow GPT-4’s out of the water — a single digit percentage difference isn’t really going to mean much to someone using ChatGPT — the multimodal inputs are really something else. I expect OpenAI and Anthropic will be rushing to add native video and audio input to their feature pipeline, if it’s not there already. It will be interesting to see how these functions stack up in terms of adding latency to the process.

Adam I.

Adam is a Lead Content Strategist at Pluralsight, with over 13 years of experience writing about technology. An award-winning game developer, Adam has also designed software for controlling airfield lighting at major airports. He has a keen interest in AI and cybersecurity, and is passionate about making technical content and subjects accessible to everyone. In his spare time, Adam enjoys writing science fiction that explores future tech advancements.

More about this author