Google Gemini Embedding 2 Launches With Native Multimodal Support

alex2404
By
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

Google has released Gemini Embedding 2 into public preview, a new embeddings model that maps text, images, video, audio, and documents into a single numerical space — without requiring any media to be transcribed into text first.

Most competing models still treat text as the primary input. To search a video library, they transcribe footage into text and embed that. Gemini Embedding 2 processes audio as sound waves and video as motion directly, cutting out the conversion step and the errors that come with it.

According to the announcement, the model reduces latency by as much as 70% for some customers. It also lowers total cost for enterprises running AI systems that query their own internal data.

What Embeddings Actually Do

An embedding model converts data — a sentence, an image, a podcast clip — into a list of numbers called a vector. Those numbers act as coordinates on a high-dimensional map. Items that are semantically similar end up positioned close together, regardless of whether they share the same media format.

The applications are widespread. Search engines use embeddings to return results based on meaning rather than exact keywords. Streaming platforms use them to power recommendations. Enterprises use them for Retrieval-Augmented Generation, where an AI assistant looks up internal documents to answer employee questions accurately.

The underlying concept traces back to linguist John Rupert Firth in the 1950s. The modern industry standard was set by Word2Vec, released in 2013 by a Google team led by Tomas Mikolov.

A Single Space for All Media

The model operates in a 3,072-dimensional space. That single unified map means developers no longer need separate retrieval systems for images and text — a text query can surface a specific moment in a video, or return an image that semantically matches the request.

Logan Kilpatrick of Google DeepMind described the capability on X, saying the model allows developers to “bring text, images, video, audio, and docs into the same embedding space.”

Sam Witteveen, co-founder of AI and ML training company Red Dragon AI, received early access and published a video review of his impressions. He noted the model’s ability to handle cross-modal retrieval without additional preprocessing pipelines.

The main competitors in the embeddings market include OpenAI, with its widely used text-embedding-3 series, and Cohere and Anthropic, which offer models focused on enterprise search and developer workflows. All three remain largely text-first architectures.

Gemini Embedding 2 is available now in public preview.

Photo by Chandler Cruttenden on Unsplash

This article is a curated summary based on third-party sources. Source: Read the original article

Share This Article