The new vision model from Cohere works on two graphics processors, overcomes the highest level of VLMS in visual tasks

The increase in deep research features and other AI evaluation led to more models and services that wish to simplify this process and read more documents that firms actually use.

Canadian Ai Company Cohere Banking its models, including a newly issued visual model, to justify that deep research functions must also be optimized in terms of cases of using the enterprise.

- Advertisement -

The company issued a vision command, a visual model specially focused on cases of using an enterprise, built at the back of its command. The parameter model of 112 billion can “unlock valuable observations based on visual data and make very accurate decisions based on data by recognizing optical characters (OCR) and image analysis,” says the company.

“Regardless of whether it is an interpretation of product instructions with complex schemes, or analyzing photos of scenes in the real world in terms of risk detection, prove the vision distinguished in dealing with the most demanding challenges of the vision of enterprises,” said the company In the post on the blog.

The AI Impact series returns to San Francisco – August 5

The next AI phase is here – are you ready? Join the leaders from Block, GSK and SAP to see the exclusive look at how autonomous agents transform the flows of the work of the company-decision-making in real time for comprehensive automation.

Secure your house now – the space is limited: https://bit.ly/3guplf

This signifies that the vision command can read and analyze the commonest types of images needed enterprises: charts, charts, diagrams, scanned documents and PDF.

? @cohere I just dropped the vision command @Huggingface ?
Designed for multimodal use cases for enterprises: interpretation of product instructions, photo evaluation, asking about charts … ❓?
112b dense in vision language with efficiency sota-check comparative indicators in … pic.twitter.com/ormfm5f8cf
– Jeff Boudier? (@JeffBoudier) July 31, 2025

Because it is built on Command A, Command A Vision architecture requires two or less GPU, similar to the text model. The VISION model also retains textual possibilities of commands and to read words in images and understands at least 23 languages. Cohere said that, unlike other models, commanding vision reduces the total cost of ownership for enterprises and is fully optimized for use for firms.

How architect recommends the architect

Cohere said it happened Llav architecture To build your command models, including a visual model. This architecture converts visual features into soft tokens, which will be divided into various tiles.

These tiles are transferred to the command of the text tower, “dense, 111b LLM text parameters,” said the company. “In this way a single picture consumes up to 3328 tokens.”

Cohere said that he trained a visual model at three stages: vision leveling, supervised tuning (SFT) and learning to strengthen after training with human feedback (RLHF).

“This approach allows you to map the function of an image encoder to the space of the language module,” said the company. “On the other hand, at the SFT stage, we simultaneously trained a vision encoder, vision adapter and language model on a variety of multimodal tasks.”

Visualization AI Enterprise

Comparative tests have shown that vision command is outweighted by other models with similar visual capabilities.

Cohere Petted prove a vision against OpenaiGPT 4.1, FinishCall 4 Maverick, MistralPixtral Large and Mistral Medium 3 in nine comparative tests. The company didn’t mention whether it tested the model against the API focused on OCR Mistral, Mistral OCR.

It enables agents to soundly see in visual data of organizations, unlock the automation of tedious tasks including slides, diagrams, PDF and photos. pic.twitter.com/ihznuwekrk
– Cohere (@cohere) July 31, 2025

Recommend a vision of losing in other models in tests akin to Chartqa, Ocrbench, AI2D and Textvqa. In general, Command A Vision had an average result of 83.1% in comparison with 78.6% GPT 4.1, Llama 4 Maverick 80.5% and 78.3% from Mistral Medium 3.

Most large language models (LLM) are currently multimodal, which suggests that they’ll generate or understand visual media akin to photos or movies. However, enterprises normally use more graphic documents, akin to charts and PDF, so the separation of information from these unstructured data sources often seems to be difficult.

With the development of deep research on the introduction of models capable of reading, analyzing and even downloading unstructured data, increased.

Cohere also said that he offers a vision command in the open weight system, in the hope that enterprises wanting to maneuver away from closed or reserved models will start using their products. So far the interest of developers.

Very impressed by the accuracy of extracting handwritten notes from the image!
– Adam Sardo (@sardo_adam) July 31, 2025

Finally, AI, who is not going to judge my terrible doodles.
– Martha Wisener? (@MartWisener) August 1, 2025

Daily observations in matters of business use with VB every day

If you must impress your boss, VB Daily is covered by you. We offer you an internal measure about what firms do with generative artificial intelligence, from regulatory changes to practical implementation, so you’ll be able to share insights for the maximum roi.

Read our Privacy Policy

Thanks for the subscription. Check out more VB newsletter here.

There was a mistake.

Active US investors were busy cutting checks in October

From Air Force officer to director general of space defense: why even Rogers left to build weapons for orbit

Cluely’s Roy Lee suggests that viral hype isn’t enough

Replika founder raises $20 million in pre-release content for Wabi, the ‘YouTube app’

Tech makers are piling up huge bets on startups even as appetite for mergers and acquisitions wanes

How entrepreneurs recover from life events without burning out

5 tips to engage Generation Z in email marketing

The pressure to start is real: why 72% of founders have mental health issues

5 questions startups should ask before implementing AI

5 email delivery tips to help you increase sales

From asking to offering: the mindset shift every founder needs

4 Strategies to Become a Category Creator

One book every new business owner should read

Why perfectionism delays your startup and how to think about it

4 things I will do differently when I start my next company

Startup funding continued to decline in November, with the number of mega rounds reaching a three-year high

German AI image generator Black Forest Labs raises $300 million at a $3.25 billion valuation as European AI funding ramps up

Funding for Edtech-specific startups remains low

Bezos launches AI startup with reported $6.2 billion in funding

10 Biggest Funding Rounds This Week: Artificial Intelligence and Defense Technologies Are Taking the Lead

The new vision model from Cohere works on two graphics processors, overcomes the highest level of VLMS in visual tasks

How architect recommends the architect

Visualization AI Enterprise

Latest Posts

Why AI coding agents aren’t production ready: fragile context windows, broken...

Tonight on StrictlyVC Palo Alto, the future of deep tech will...

“Truth serum” for artificial intelligence: a new OpenAI method for training...

This VC charges $0 for PR and has 12 unicorns to...

Why AI coding agents aren’t production ready: fragile context windows, broken...

“Truth serum” for artificial intelligence: a new OpenAI method for training...

AI Denial Becomes a Risk for the Enterprise: Why Ignoring “Weaknesses”...

Yes, I’m biased. Still, leading unicorns like Anthropic should be preparing...

Recomended

Why AI coding agents aren’t production ready: fragile context windows, broken refactors, lack of operational awareness

Tonight on StrictlyVC Palo Alto, the future of deep tech will be explained to you

“Truth serum” for artificial intelligence: a new OpenAI method for training models to confess errors

This VC charges $0 for PR and has 12 unicorns to show

Sources: Aaru, an artificial intelligence research startup, raises Series A value at a “principal” valuation of $1 billion

The 10 biggest financing rounds this week: Investors are back to writing big checks