As LLM solutions proceed to improve in the industry, there is an ongoing discussion about the continued need for stand-alone data labeling tools as LLM tools increasingly have the ability to work with all sorts of data. human signal, leading industrial provider of open-source Label Studio software, takes a different view. Instead of seeing less need for data labeling, the company sees more.
Earlier this month, HumanSignal acquired Erud AI and launched physical Frontier Data Labs to collect cutting-edge data. But creating data is only half the challenge. Today, the company is focused on what comes next: proving that artificial intelligence systems trained on this data actually work. New multi-modal agent evaluation capabilities enable enterprises to validate complex AI agents generating applications, images, code and video.
“If you focus on enterprise segments, all the AI solutions they create still require evaluation, which is another term for marking data by humans, much less by experts,” HumanSignal co-founder and CEO Michael Malyuk told VentureBeat in an exclusive interview.
The intersection of data labeling and AI agent evaluation
Having the right data is great, but it is not the ultimate goal of a business. Evaluation is moving towards modern data labeling.
This is a fundamental change in what corporations need to confirm: not whether their model appropriately classified the image, but whether the AI agent made good decisions in a complex, multi-step task involving reasoning, tool use, and code generation.
If evaluation is just about labeling AI output, then moving from models to agents represents a step change in what needs to be labeled. Where traditional data labeling may involve tagging images or categorizing text, agent evaluation requires the evaluation of multi-step chains of reasoning, tool selection decisions, and multimodal outcomes – all in a single interaction.
“There is a very strong need for not just a human in the loop, but an expert in the loop,” Malyuk said. He gave examples of high-stakes applications such as healthcare and legal advice where the cost of errors stays too high.
The connection between data labeling and AI evaluation goes deeper than semantics. Both activities require the same basic skills:
-
Structured interfaces enabling human evaluation: Whether reviewers are labeling images for training data or assessing whether an agent has properly orchestrated multiple tools, they need specially designed interfaces to systematically record their evaluations.
-
Consensus of multiple reviewers: High-quality training datasets require multiple labelers to reconcile disagreements. High-quality evaluation requires the same – multiple experts assessing the results and resolving differences in assessments.
-
Domain knowledge at scale: Training modern AI systems requires subject material experts, not only customer support people and button clickers. Assessing AI production performance requires the same in-depth expertise.
-
Feedback is passed on to AI systems: Developing a model of labeled training data sources. Evaluation data enables continuous improvement, tuning and comparison.
Full agent tracking evaluation
The challenge with assessing agents is not only the amount of data, but also the complexity of what needs to be assessed. Agents do not generate easy text results; they generate chains of reasoning, select tools, and create artifacts across multiple modalities.
New capabilities in Label Studio Enterprise address agent validation requirements:
-
Multimodal trace inspection: The platform provides unified interfaces for viewing an agent’s complete execution traces – inference steps, tool calls, and results across modalities. This solves the common problem of teams having to analyze separate log streams.
-
Interactive multi-turn evaluation: Evaluators evaluate a conversational flow in which agents maintain a state over multiple turns, verifying context tracking and interpretation of intent throughout the interaction sequence.
-
Agent Arena: A comparative evaluation framework for testing different agent configurations (base models, prompt templates, guardrail implementations) under equivalent conditions.
-
Flexible assessment rubrics: Teams define domain-specific evaluation criteria programmatically, reasonably than using pre-defined metrics, meeting supporting requirements such as accuracy of understanding, adequacy of responses, or quality of results for specific use cases
Agent scoring is the recent battleground for data labeling providers
HumanSignal is not alone in recognizing that agent scoring is the next phase of the data labeling market. Competitors make similar changes as the industry responds to each technological change and market disruption.
Label box in August 2025, it launched the Evaluation Lab, focusing on rubric evaluations. Like HumanSignal, the company is moving beyond traditional data labeling and moving into AI production validation.
The overall competitive landscape for data labeling modified dramatically in June when Meta invested $14.3 billion to acquire a 49% stake in Scale AI, the previous market leader. The transaction resulted in the exodus of some of Scale’s largest customers. HumanSignal took advantage of the situation, and Malyuk claimed his company managed to win a variety of competitive deals last quarter. Malyuk cites platform maturity, configuration flexibility and customer support as differentiators, although competitors make similar claims.
What does this mean for AI creators?
For enterprises building manufacturing AI systems, the convergence of data labeling and evaluation infrastructure has several strategic implications:
Start with the ground truth. The investment in creating high-quality labeled datasets with multiple expert reviewers to resolve disputes pays off across the entire AI development lifecycle, from initial training to continuous production improvement.
Observability seems to be obligatory, but not sufficient. While monitoring the performance of AI systems stays essential, observability tools measure activity, not quality. Enterprises need a dedicated evaluation infrastructure to evaluate results and make improvements. These are separate problems requiring different capabilities.
The training data infrastructure also serves as an evaluation infrastructure. Organizations that have invested in data labeling platforms for model development can extend the same infrastructure for production evaluation. These are not separate problems requiring separate tools – they are the same basic workflow applied at different stages of the lifecycle.
For enterprises deploying AI at scale, the bottleneck has shifted from model building to model verification. Organizations that recognize this transformation early will gain an advantage when it comes to AI systems for shipping manufacturing.
A key query for enterprises has evolved: not whether AI systems are advanced enough, but whether organizations can systematically exhibit that they meet quality requirements in specific, high-stakes domains.
