Databricks research shows that building better AI judges is not just a technical problem, it is a human problem

The intelligence of AI models is not what holds back enterprise implementations. It is primarily the inability to define and measure quality.

This is where AI judges are now playing an increasingly vital role. In AI assessment, a “judge” is an AI system that evaluates the performance of one other AI system.

Judge Builder is Databricks’ judge creation platform that was first implemented as a part of the company’s software Agent Bricks technology earlier this yr. The framework has evolved significantly since it was first launched in response to direct user feedback and implementations.

- Advertisement -

Early releases focused on technical implementation, but customer feedback revealed that the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three principal challenges: getting stakeholders to agree on quality criteria, gaining domain expertise from limited domain experts, and implementing assessment systems at scale.

“Model intelligence is usually not the bottleneck, models are truly intelligent,” Jonathan Frankle, chief artificial intelligence scientist at Databricks, told VentureBeat during an exclusive briefing. “Instead, it’s really about the question of how do we get the models to do what we want them to do, and how do we know if they did what we wanted them to do?”

AI Assessment’s “Ouroboros Problem.”

Judge Builder addresses what Pallavi Koppol, a Databricks scientist who led development, calls the “Ouroboros problem.” Ouroboros is an ancient symbol depicting a snake eating its own tail.

Using AI systems to judge AI systems creates a round-robin validation challenge.

“You want the referee to check if your system is good, if your AI system is good, but then your referee is also an AI system,” Koppol explained. “And now you say, well, how do I know this judge is good?”

The solution is to measure “distance to expert truth” as the primary scoring function. By minimizing the difference between how an AI judge evaluates results and how domain experts would evaluate them, organizations can trust these judges as scalable proxies for human evaluation.

This approach is fundamentally different from the traditional one handrail systems or assessments based on single indicators. Instead of asking whether an AI rating has passed general quality control or not, Judge Builder creates very specific evaluation criteria tailored to your organization’s expertise and business requirements.

It is also distinguished by its technical workmanship. Judge Builder integrates with MLflow and fast optimization tools and can work with any base model. Teams can version-control their referees, track performance over time, and deploy multiple referees concurrently across different quality dimensions.

Lessons learned: building judges who actually act

Databricks’ work with enterprise clients has revealed three key takeaways that apply to anyone evaluating AI.

Lesson one: Your experts don’t agree as much as you think. When quality is subjective, organizations find that even their experts disagree on what constitutes an acceptable result. A customer support response could also be factual but used in an inappropriate tone. The financial summary could also be comprehensive but too technical for the audience.

“One of the biggest lessons from this whole process is that all problems become people problems,” Frankle said. “The hardest thing is to translate an idea from the human brain into something unambiguous. And the hardest thing is that companies are not one brain, but many brains.”

The fix is so as to add aggregate annotations with inter-rater reliability checks. Teams describe examples in small groups and then measure compliance before continuing. This allows misalignment to be detected early. In one case, three experts gave the same result a rating of 1, 5 and a neutral rating, before discussion revealed that they interpreted the rating criteria in a different way.

Companies using this approach achieve inter-rater reliability rankings of 0.6 in comparison with the typical 0.3 for third-party annotation services. Greater consistency translates directly into better referee performance because the training data comprises less noise.

Lesson two: Divide unclear criteria into specific judges. Instead of one judge assessing whether a response is “relevant, factual and concise,” three separate judges ought to be created. Each of them focuses on a specific aspect of quality. This granularity matters because a poor “overall quality” rating reveals that something is incorrect, but does not indicate what must be fixed.

The best results are achieved by combining top-down requirements, corresponding to regulatory constraints and stakeholder priorities, with bottom-up discovery of observed failure patterns. One client built a top-down accuracy rating system, but through data evaluation, he discovered that the correct answers almost at all times returned the top two search results. This insight became a latest, production-friendly judge that could determine accuracy without having to use fact-based labels.

Lesson three: You need fewer examples than you think. Teams can create solid judges from just 20-30 well-selected examples. The key is to decide on edge cases that reveal disagreement, moderately than obvious examples where everyone agrees.

“For some teams, we can do this process in as little as three hours, so it doesn’t take that long to find a good referee,” Koppol said.

Production results: from pilot projects to seven-figure implementations

Frankle shared three metrics that Databricks uses to measure Judge Builder’s success: whether customers wish to use it again, whether or not they are increasing their AI spending, and whether or not they are progressing further on their AI journey.

For the first metric, one client created a dozen judges after an initial workshop. “This client convinced over a dozen judges after we first walked them through a rigorous process using this platform,” Frankle said. “They really went to town on the referees and now they measure everything.”

For the second metric, the business impact is clear. “Many customers attended this workshop and spent seven-figure sums on GenAI at Databricks in a way they hadn’t done before,” Frankle said.

The third metric shows Judge Builder’s strategic value. Customers who were previously hesitant to make use of advanced techniques corresponding to reinforcement learning can now implement them with confidence because they’ll measure whether improvement has actually occurred.

“There are clients who have done very advanced things after previously their judges were not willing to do it,” Frankle said. “They went from doing a bit of quick engineering to reinforcement learning with us. Why spend money on reinforcement learning and why waste energy on reinforcement learning if you don’t know if it actually made a difference?”

What corporations should do now

Teams that successfully take AI from pilot to production treat judges not as disposable artifacts, but as evolving assets that grow with their systems.

Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement and one observed failure mode. These change into your initial judging portfolio.

Second, create lightweight workflows with material experts. A few hours of reviewing 20-30 Edge cases provides sufficient calibration for most judges. Use aggregate annotations and inter-rater reliability checks to denoise your data.

Third, schedule regular judging assessments using production data. As the system evolves, latest failure modes will emerge. Your referee portfolio should evolve with them.

“Evaluation is a way to evaluate the model, it’s also a way to create guardrails, it’s also a way to get a metric against which you can do rapid optimization, and it’s also a way to get a metric against which you can do reinforcement learning,” Frankle said. “Once you have a judge that you know represents your human taste in an empirical form that you can ask about as often as you want, you can use it in 10,000 different ways to measure or improve your agents.”

Databricks research shows that building better AI judges is not just a technical problem, it is a human problem

AI Assessment’s “Ouroboros Problem.”

Lessons learned: building judges who actually act

Production results: from pilot projects to seven-figure implementations

What corporations should do now

Latest Posts

Recomended