Will updating AI agents help or hurt their performance? The new Raindrop Experiments tool will tell you that

It seems that over the last two years since ChatGPT launched, new large language models (LLMs) have been released almost every week from competing labs or from OpenAI itself. Enterprises are struggling to maintain up with the enormous pace of change, let alone understand the best way to adapt to it – which of those new models should they adopt, if any, to enhance their workflows and the custom AI agents they create to hold them out?

Help has arrived: Making AI applications observable Raindrop launched Experimentsa new analytics feature that the company describes as the first A/B testing suite designed specifically for enterprise AI agents — enabling firms to see and compare how updating agents to new base models or changing their instructions and access to tools will impact their performance for real end users.

This release extends Raindrop’s existing remark tools, giving developers and teams the ability to see how their agents behave and evolve in real-world conditions.

- Advertisement -

Through experiments, teams can track how changes—reminiscent of a new tool, prompt, model update, or full pipeline refactor—affect AI performance across tens of millions of user interactions. The new feature is now available to users on the Raindrop Pro subscription plan ($350/month) at raindrop.ai.

A knowledge-driven view of agent development

Raindrop Co-Founder and Chief Technology Officer Ben Hylak The product announcement video (above) notes that Experiments help teams see “how literally anything has changed,” including tool usage, user intent, and issue counts, in addition to examine differences by demographic aspects reminiscent of language. The goal is to make model iteration more transparent and measurable.

The Experiments interface presents results visually, showing when an experiment is performing higher or worse than its baseline. An increase in negative signals may indicate increased task failure or partial code output, while improvements in positive signals may reflect more complete responses or a higher user experience.

By making this data easier to interpret, Raindrop encourages AI teams to approach agent iteration with the same rigor as they implement modern software – tracking results, sharing insights, and eliminating regressions before combining them.

Introduction: From artificial intelligence observability to experimentation

Raindrop’s launch of Experiments builds on the company’s foundation as one of the first AI-native observability platformsdesigned to help enterprises monitor and understand how their generative AI systems behave in production.

As VentureBeat reported earlier this 12 months, the company — originally often called Dawn AI — emerged to handle the Hylak problem, former Apple human interface designer called the “black box problem” of AI performance, helping teams catch failures “as they happen and explain to enterprises what went wrong and why.”

At the time, Hylak described how “AI products fail all the time – in ways that are both hilarious and terrifying,” noting that unlike traditional software that introduces explicit exceptions, “AI products fail silently.” The original Raindrop platform focused on detecting these silent failures by analyzing signals reminiscent of user feedback, task failures, denials, and other conversational anomalies across tens of millions of each day events.

Co-founders of the company – Hylak, Alexis GaubaAND Zubin Singh Koticha – Built Raindrop after experiencing the difficulties of debugging AI systems in production.

“We started out building AI products, not infrastructure,” Hylak said VentureBeat. “But we saw pretty quickly that to grow something serious we needed tools to understand AI behavior, and those tools didn’t exist.”

With Experiments, Raindrop extends the same mission from failure detection Down measuring improvements. The new tool transforms observable data into actionable comparisons, enabling enterprises to check whether changes to their models, prompts or pipelines will actually improve their AI agents – or simply change them.

Resolving the “Evals Pass, Agents Fail” issue.

Traditional evaluation frameworks, while useful for benchmarking, rarely capture the unpredictable behavior of AI agents operating in dynamic environments.

As co-founder of Raindrop Alexis Gauba explained therein Advertisement on LinkedIn“Traditional evaluations don’t really answer this question. They’re great unit tests, but you can’t predict user actions and your agent runs for hours calling hundreds of tools.”

Gauba said there’s a constant voice of frustration from teams throughout the company: “Evaluations pass, agents fail.”

Experiments aim to fill this gap by showing what is actually changing when developers push updates to their systems.

The tool enables direct comparison of models, tools, intents or properties, revealing measurable differences in behavior and performance.

Designed with real-world AI behavior in mind

In the announcement video, Raindrop described the experiments as a approach to “compare anything and measure how an agent’s behavior actually changed in production based on millions of real-world interactions.”

The platform helps users detect problems reminiscent of an increase in task failures, forgetfulness, or new tools causing unexpected errors.

It may also be applied in reverse – starting with a known problem, reminiscent of “agent stuck in a loop”, and tracking which model, tool, or flag causes it.

From there, developers can drill down into detailed paths to search out the root cause and quickly deliver a fix.

Each experiment provides a visual summary of metrics reminiscent of tool use frequency, error rate, call duration, and response length.

Users can click on any comparison to access event data, giving them a clear picture of how agent behavior has modified over time. Shared links make it easier to collaborate with team members or report findings.

Integration, scalability and accuracy

According to Hylak, Experiments integrates directly with “flagship platforms that companies know and love (like Statsig!)” and is designed to work seamlessly with existing telemetry and analytics pipelines.

For firms without such integrations, it might probably still compare performance over time – reminiscent of yesterday and today – without additional configuration.

Hylak said teams typically need about 2,000 users per day to get statistically significant results.

To ensure accurate comparisons, Experiments monitors the adequacy of sample size and alerts users if there is not enough data in the test to attract valid conclusions.

“We’re obsessed with making sure metrics like task failures and user frustration are indicators that you could wake up an engineer on call,” Hylak explained. He added that teams can analyze specific conversations or events that impact these metrics, providing transparency into each aggregate number.

Security and data protection

Raindrop operates as a cloud-hosted platform, but also offers on-site personal information (PII) redaction for businesses needing additional control.

Hylak determined that the company complied with SOC 2 requirements and launched it PII Guardian a feature that uses artificial intelligence to mechanically remove sensitive information from stored data. “We take the protection of customer data very seriously,” he emphasized.

Prices and plans

Experiments are a part of Raindrop’s Professional planwhich costs $350 per 30 days or $0.0007 per interaction. The Pro tier also includes deep research tools, topic grouping, custom issue tracking, and semantic search capabilities.

Raindrop Startup plan – $65 per 30 days or $0.001 per interaction – Provides basic analytics including issue detection, user feedback signals, Slack alerts, and user tracking. Both plans include a 14-day free trial.

Larger organizations may select this feature Corporate plan with custom pricing and advanced features like SSO, custom alerts, integrations, redaction at the edge, and priority support.

Continuous improvement of AI systems

With Experiments, Raindrop positions itself at the intersection of AI analytics and software observability. The focus on “measuring truth,” as stated in the product video, reflects a broader industry push for accountability and transparency in AI operations.

Instead of relying solely on offline benchmarking, Raindrop’s approach emphasizes real user data and contextual understanding. The company hopes this will enable AI developers to maneuver faster, discover root causes earlier, and confidently deliver better-performing models.

Will updating AI agents help or hurt their performance? The new Raindrop Experiments tool will tell you that

A knowledge-driven view of agent development

Introduction: From artificial intelligence observability to experimentation

Resolving the “Evals Pass, Agents Fail” issue.

Designed with real-world AI behavior in mind

Integration, scalability and accuracy

Security and data protection

Prices and plans

Continuous improvement of AI systems

Latest Posts

Recomended