The new vision model from Cohere works on two graphics processors, overcomes the highest level of VLMS in visual tasks



The increase in deep research features and other AI evaluation led to more models and services that wish to simplify this process and read more documents that firms actually use.

Canadian Ai Company Cohere Banking its models, including a newly issued visual model, to justify that deep research functions must also be optimized in terms of cases of using the enterprise.

- Advertisement -

The company issued a vision command, a visual model specially focused on cases of using an enterprise, built at the back of its command. The parameter model of 112 billion can “unlock valuable observations based on visual data and make very accurate decisions based on data by recognizing optical characters (OCR) and image analysis,” says the company.

“Regardless of whether it is an interpretation of product instructions with complex schemes, or analyzing photos of scenes in the real world in terms of risk detection, prove the vision distinguished in dealing with the most demanding challenges of the vision of enterprises,” said the company In the post on the blog.


The AI Impact series returns to San Francisco – August 5

The next AI phase is here – are you ready? Join the leaders from Block, GSK and SAP to see the exclusive look at how autonomous agents transform the flows of the work of the company-decision-making in real time for comprehensive automation.

Secure your house now – the space is limited: https://bit.ly/3guplf


This signifies that the vision command can read and analyze the commonest types of images needed enterprises: charts, charts, diagrams, scanned documents and PDF.

Because it is built on Command A, Command A Vision architecture requires two or less GPU, similar to the text model. The VISION model also retains textual possibilities of commands and to read words in images and understands at least 23 languages. Cohere said that, unlike other models, commanding vision reduces the total cost of ownership for enterprises and is fully optimized for use for firms.

How architect recommends the architect

Cohere said it happened Llav architecture To build your command models, including a visual model. This architecture converts visual features into soft tokens, which will be divided into various tiles.

These tiles are transferred to the command of the text tower, “dense, 111b LLM text parameters,” said the company. “In this way a single picture consumes up to 3328 tokens.”

Cohere said that he trained a visual model at three stages: vision leveling, supervised tuning (SFT) and learning to strengthen after training with human feedback (RLHF).

“This approach allows you to map the function of an image encoder to the space of the language module,” said the company. “On the other hand, at the SFT stage, we simultaneously trained a vision encoder, vision adapter and language model on a variety of multimodal tasks.”

Visualization AI Enterprise

Comparative tests have shown that vision command is outweighted by other models with similar visual capabilities.

Cohere Petted prove a vision against OpenaiGPT 4.1, FinishCall 4 Maverick, MistralPixtral Large and Mistral Medium 3 in nine comparative tests. The company didn’t mention whether it tested the model against the API focused on OCR Mistral, Mistral OCR.

Recommend a vision of losing in other models in tests akin to Chartqa, Ocrbench, AI2D and Textvqa. In general, Command A Vision had an average result of 83.1% in comparison with 78.6% GPT 4.1, Llama 4 Maverick 80.5% and 78.3% from Mistral Medium 3.

Most large language models (LLM) are currently multimodal, which suggests that they’ll generate or understand visual media akin to photos or movies. However, enterprises normally use more graphic documents, akin to charts and PDF, so the separation of information from these unstructured data sources often seems to be difficult.

With the development of deep research on the introduction of models capable of reading, analyzing and even downloading unstructured data, increased.

Cohere also said that he offers a vision command in the open weight system, in the hope that enterprises wanting to maneuver away from closed or reserved models will start using their products. So far the interest of developers.

Latest Posts

Advertisement

More from this stream

Recomended