The world's largest open-source multimodal dataset delivers 17x training efficiency by unlocking enterprise AI that combines documents, audio and video

AI models are only pretty much as good as the data they are trained on. This data typically must be labeled, organized, and organized before models can learn from it effectively.

One of the most significant missing links in the AI ecosystem is the availability of a large, high-quality, open-source, multimodal dataset. That changes today with the debut of the EMM-1 dataset, which consists of 1 billion data pairs and 100 million data groups across 5 modalities: text, image, video, audio, and 3D point clouds. Multimodal datasets mix various kinds of data that AI systems can process together. This reflects the way people perceive the world using multiple senses concurrently. These datasets enable AI systems to attract richer conclusions by understanding the relationships between data types, reasonably than processing each modality individually.

EMM-1 was developed by data labeling platform provider Encord. The company’s platform enables teams to curate, tag and manage training data at scale using each automated and in-the-loop workflows. In addition to the recent model, Encord has developed the EBind training methodology, which prioritizes data quality over raw computational scale. This approach enabled the creation of a compact 1.8 billion model that matched the performance of models as much as 17 times larger, while reducing training time from days to hours on a single GPU reasonably than clusters of GPUs.

- Advertisement -

“The biggest trick for us was to focus on the data and make it very, very high quality,” Encord co-founder and CEO Eric Landau told VentureBeat in an exclusive interview. “We were able to achieve the same level of performance as models 20 times larger, not because we were very smart about the architecture, but because we trained it overall on really good data.”

The advantage of knowledge quality

According to Landau, the Encord dataset is 100 times larger than the next comparable multimodal dataset. It operates at petabyte scale with terabytes of raw data and over 1 million human annotations.

However, scale alone does not explain the increase in efficiency. Technical innovations are focused on solving what Landau calls an “underappreciated” problem in AI training: data leakage between training and evaluation sets.

“We spent a lot of time on the leak problem,” Landau explained. “In a lot of datasets, there’s a kind of leakage between different subsets of the data. Leakage actually improves the results. It makes your grades look better. But that’s one thing we’ve been working hard on.”

Data leakage occurs when information from test data by chance appears in training data, artificially inflating model performance metrics. Many benchmark datasets suffer from this contamination. Encord has implemented hierarchical clustering techniques to make sure clean separation while maintaining a representative distribution among data types. The company also used clustering to eliminate bias and ensure diverse representation.

How EBind increases efficiency

Data quality improvements work with an architectural approach designed for performance

Encord’s EBind solution expands the CLIP (Contrastive Language-Image Pre-training) approach (originally developed by OpenAI) from two to 5 modalities. CLIP learns to associate images and text in a common representation space, enabling tasks comparable to searching for images using text descriptions.

Where CLIP learns to associate images and text in a shared hidden space, EBind does the same with images, text, audio, 3D point clouds and video.

When choosing an architecture, parameter performance is a priority. Instead of implementing separate specialized models for each pair of modalities, EBind uses a single base model with one encoder per modality.

“Other methodologies are that they use a lot of different models and choose the best model to embed these pairs, so the number of parameters tends to explode,” Landau said. “We found that we could use one base model and just train one encoder for each modality, which would make the whole thing very simple and very parameter efficient if we gave the whole architecture really good data.”

The resulting model of rivals OmniBinda much greater competitor in the multimodal space, but requires significantly less computational resources for each training and inference. This allows EBind to be deployed in resource-constrained environments, including edge devices for robotics and autonomous systems.

The enterprise value of a multimodal dataset

Multimodal models enable enterprise use cases involving various kinds of data.

Most organizations store various kinds of data in separate systems: documents in content management platforms, audio recordings in communication tools, training videos in learning management systems, and structured data in databases. Multimodal models can search and retrieve information in all areas concurrently.

“Enterprises have all kinds of data. They don’t just have documents. They have audio recordings, training videos and CSV files,” Landau said. “Let’s say you’re a lawyer and you have a case file containing video evidence, as well as documents and recordings, all scattered across multiple data silos. You can use EBind to select all the relevant data and bring it together to search and surface the right data much faster than before.”

The same principle applies in all industries. Healthcare professionals can mix patient imaging data with clinical notes and diagnostic audio. Financial services firms can mix transaction records with recordings of compliance calls and customer communications. Manufacturing operations can link equipment sensor data with maintenance video records and inspection reports.

Outside of office environments, physical AI is the next frontier. Landau pointed to autonomous vehicles that use each visual perception and auditory signals, comparable to emergency sirens. In manufacturing and warehousing, robots that mix visual recognition with audio feedback and spatial awareness can operate more safely and efficiently than systems using only vision technology.

Enterprise use case: Extending computer vision to a multimodal context

Capture artificial intelligenceEncord client, illustrates how firms plan to make use of the dataset for specific business applications. The startup provides on-device image verification for mobile apps, checking photos in real-time for authenticity, consistency and quality before uploading. The company works with shared mobility providers like Lime and delivery firms to capture billions of photos of packages.

Captur AI processes over 100 million images on device and specializes in distilling models all the way down to 6-10 megabytes so they’ll run on smartphones without cloud connectivity. However, CEO Charlotte Bax believes that multimodal capabilities are key to expanding the range of higher-value applications.

“For us, the market is huge. You upload photos for returns and retail sales. You submit photos to insurance companies for claims. You upload photos when you list something on eBay,” Bax told VentureBeat in an exclusive interview. “Some of these use cases are very high risk or high value, if something goes wrong, like insurance, the image only captures part of the context, and the audio can be an important signal.”

Bax cited digital vehicle inspections as a prime example. When customers take photos of auto damage as a part of an insurance claim, they often verbally describe what happened while the photos were taken. Audio context can significantly improve claim accuracy and reduce fraud.

“When you do this, often the client will actually describe what happened,” Bax said. “Several of our potential InsurTech job candidates have asked us if we can actually record the audio as well, as it adds additional context for the user making the claim.”

The challenge is to keep up Captur AI’s core advantage: running models efficiently on the device reasonably than having to process in the cloud. The company plans to make use of the Encord dataset to coach compact multimodal models that retain real-time offline capabilities while adding audio context and sequential imagery.

“The most important thing you can do is try to get as much context as you can,” Bax said. “Can LLMs be made small enough to run on a device in the next three years, or can multimodal models be run on a device? An interesting area is checking data quality before uploading an image.”

What does this mean for businesses

The results of the Encord project challenge basic assumptions about the development of artificial intelligence and suggest that the next competitive battleground could also be data operations, not infrastructure scale.

Multimodal data sets open up recent possibilities. The ability to coach models that understand relationships between data types opens up use cases that unimodal systems cannot handle.

Data operations deserve the same investment as computing infrastructure. A 17-fold increase in parameter efficiency thanks to higher data processing translates into orders of magnitude of savings. Organizations that dedicate resources to GPU clusters and consider data quality as an afterthought could also be optimizing for the mistaken variable.

For firms building multimodal AI systems, Landau’s assessment reflects a strategic shift.

“We were able to achieve the same level of performance as much larger models, not because we were very clever with the architecture, but because we trained it on really good data,” he said.

The world’s largest open-source multimodal dataset delivers 17x training efficiency by unlocking enterprise AI that combines documents, audio and video

The advantage of knowledge quality

How EBind increases efficiency

The enterprise value of a multimodal dataset

Enterprise use case: Extending computer vision to a multimodal context

What does this mean for businesses

Latest Posts

Recomended