From hallucinations to equipment: lessons from a real computer vision project went to the side



Computer Vision projects rarely go exactly as planned, and this was no exception. The idea was easy: build a model that might look at a photo of the laptop and discover all physical damage – things like cracked screens, missing keys or broken hinges. It seemed that this is a easy case of using models of paintings and large languages ​​(LLM), but it quickly turned into something more complicated.

Along the way, we encountered problems with hallucinations, unreliable outputs and images that weren’t even laptops. To solve them, we finally used the agency frames in an unusual way – not for automating tasks, but to improve the performance of the model.

- Advertisement -

In this post we’ll go through what we tried, which didn’t work, and how the combination of approaches finally helped us build something reliable.

Where we began: monolithic hint

Our initial approach was quite standard for the multimodal model. We used a single, large hint to forward the image to LLM available by the image and asked it to discover visible damage. This monolithic hint strategy is easy to implement and works decently to clean, well -defined tasks. But real data rarely recreate.

Earlier we encountered three predominant problems:

  • Hallucinations: The model sometimes got here up with damage that didn’t exist or didn’t take into account what he saw.
  • Detection of garbage image: He didn’t have a reliable way of marking images that weren’t even laptops, resembling photos of offices, partitions or people from time to time, slipped and received senseless damage reports.
  • Inconsistent accuracy: The combination of those problems meant that the model was too unbelievable for operational use.

It was a point where it became clear that we’d have toil it.

First amendment: Mixing images

One of the things we noticed was how much image quality affected the model. Users sent all types of images, from acute and high resolution to blur. This led us to reference tests Emphasizing how the image resolution affects deep learning models.

We trained and tested the model with a mixture of high and low resolution images. The idea was to make the model more resistant to the wide selection of image features that it might encounter in practice. This helped improve consistency, but the basic problems with hallucinations and garbage service remained.

Multimodal detour: LLM only text

Encouraged by recent experiments in combining image signatures with LLM only text-like the technique discussed, in which the signatures are generated from the images, and then interpreted by the language model, we decided to try.

Here’s how it really works:

  • LLM begins with generating many possible signatures for the image.
  • Another model, called the multimodal embedding model, checks how well every signature matches the image. In this case, we used Siglip to obtain a similarity between the image and the text.
  • The system maintains some of the most significant signatures based on these results.
  • LLM uses these most significant signatures to write latest ones, trying to get closer to what the image actually shows.
  • He repeats this process so long as the signatures stop improving or reaches the set limit.

Although theoretically, this approach has introduced latest problems for our use case:

  • Permanent hallucinations: The inscriptions themselves sometimes included imaginary damage, which LLM then convinced.
  • Incomplete coverage: Even with many signatures, some problems have been completely omitted.
  • Increased complexity, little profit: The steps added made the system more complicated without a reliable elevation of the previous configuration.

It was an interesting experiment, but ultimately not a solution.

Creative use of aggressive frames

It was a turning point. While agency frames are normally used to organize tasks (think that agents coordinating calendar invitations or customer support), we were wondering whether the division of the image interpretation of the image into smaller, specialized agents will help.

We have built such an agency structure:

  • Orchestra agent: He checked the image and identified which laptop components were visible (screen, keyboard, chassis, ports).
  • Component agents: Dedicated agents checked each element in terms of specific kinds of damage; For example, one for cracked screens, the other for missing keys.
  • Garbage detection agent: A separate agent marked whether the image was even a laptop.

This modular approach based on tasks gave much more precise and explaining results. Hallucinations dropped dramatically, junk images were reliably marked, and the task of every agent was easy and sufficiently focused to control the quality well.

Blind Spots: Compromises of the agency approach

Although it was effective, it was not perfect. Two predominant restrictions appeared:

  • Increased delay: Starting many sequential agents added to the total application time.
  • Gaps inside range: Agents could only detect problems that were clearly programmed. If the picture showed something unexpected that no agent had the task of identification, he could be unnoticed.

We needed a way to balance precision with range.

Hybrid solution: combining aggressive and monolithic approaches

To fill the gaps, we created a hybrid system:

  1. . Agency frame First, he acted by serving precise detection of known kinds of damage and images of garbage. We have limited the variety of agents to the most significant to improve delays.
  2. Then Monolithic image of LLM He scanned the picture of the whole lot else that agents could miss.
  3. Finally we I adapted the model Using a chosen set of images for high priority use cases, resembling regularly reported damage scenarios to further improve accuracy and reliability.

This combination gave us precision and explanation of the agency configuration, a wide coverage of monolithic hints and increasing the trust of targeted tuning.

What we have learned

A few things became clear before we finished this project:

  • The agency frames are more versatile than they receive a loan: Although they are normally related to work flow management, we have found that they might significantly increase the model’s efficiency after use in a structured, modular way.
  • Mixing different approaches to one: The combination of precise detection based on agents with a wide selection of LLM, in addition to a little bit of refinement where it was meaning, gave us much more reliable results than any single method alone.
  • Visual models are susceptible to hallucinations: Even more advanced configurations can draw conclusions or see things that are not there. Maintaining these errors requires a thoughtful system design.
  • The number of image quality makes a difference: Training and testing with each clear high -resolution and on a regular basis images, lower quality helped the model resistant to unpredictable photos in the real world.
  • You need a way to catch junk images: Dedicated garbage control or unrelated photos was one of the simplest changes that we introduced, and had a huge impact on the general reliability of the system.

Final thoughts

What began as a easy idea, using LLM prompt to detect physical damage to the laptop images, quickly transformed into a much deeper experiment in combining various AI techniques to solve unpredictable problems. Along the way, we realized that some of the most useful tools were originally designed for one of these work.

Agency frames, often seen as tools for work flow, proved to be surprisingly effective when intended for tasks, resembling structural detection of injury and image filtering. With a little creativity, they helped us build a system that was not only more accurate, but easier to understand and manage in practice.

Latest Posts

Advertisement

More from this stream

Recomended