Join our day by day and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more
Matt Shumer, co-founder and CEO of OthersideAI, also often called its flagship AI-assisted writing product Hyperscriptbroke nearly two days of silence after being accused of fraud when independent researchers failed to duplicate the supposedly best-in-class performance of a latest large language model (LLM) that it released on Thursday, September 5.
On his social media account X, Shumer apologized and said it was “ahead of my imagination,” adding: “I know many of you are excited about the potential of this technology, but are skeptical right now.”
However, his latest statements doesn’t fully explain why his Reflection 70B model, which he claimed was a variant of the Llama 3.1 Meta, was trained using Glaive AI synthetic data generation platform, didn’t perform in addition to he originally claimed in all subsequent independent tests. Shumer also didn’t explain exactly what went mistaken. Here’s the timeline:
Thursday, September 5, 2024: Initial, exaggerated claims of Reflection 70B’s superior performance in benchmarks
If you are just beginning to catch up, last week Shumer released the Reflection 70B, AI Hugging Face Open Source Communitycalling it “the world’s best open source model” in the post on X and posted a chart showing what he believes are the state-of-the-art results achieved in third-party benchmarks.
Shumer says the impressive results were achieved due to a technique called “Reflection Tuning,” which allows the model to guage and refine its answers for correctness before delivering them to users.
VentureBeat interviewed Shumer and took his benchmarks as presented, attributing them to him because we don’t have the time or resources to run our own independent benchmarks — and most of the model vendors we’ve covered so far have been honest.
Friday, September 6 – Monday, September 9: Third-party rankings do not reflect impressive Reflection 70B results – Shumer accused of fraud
However, just days after the debut and last weekend, independent, external appraisers and members open source AI community posts on Reddit AND Hacker news began to query the model’s performance and were unable to breed it on their very own. Some even found answers and data indicating that the model was linked to — perhaps just a thin “wrapper” — referring to Anthropic’s Claude 3.5 Sonnet model.
Criticism grew after Artificial Analysis, an independent organization that evaluates artificial intelligence, posted on X information about his Reflection 70B tests gave significantly lower results than HyperWrite initially claimed.
Shumer was also It was stated that Glaive was investedartificial intelligence startup whose synthetic data he used to coach his model, which he didn’t disclose when he released Reflection 70B.
Shumer attributed the discrepancies to issues with the model upload process to Hugging Face and promised to correct the model weights last week, but has not done so yet.
One of the X users, Shin Megami Boson, openly accused Shumer “fraud in the artificial intelligence research community” on Sunday, September 8. Shumer didn’t respond on to the accusation.
After posting and re-posting various X messages related to Reflection 70B, Shumer went silent Sunday evening and didn’t reply to VentureBeat’s request for comment — nor did he post any public X posts — until this evening, Tuesday, September 10.
Additionally, AI researchers akin to Nvidia’s Jim Fan have noticed it was easy to coach even less efficient models (with lower parameters or complexity) to perform well in third-party benchmarks.
Tuesday, September 10: Shumer responds and apologizes — but doesn’t explain discrepancies
Shumer finally issued a statement regarding X tonight at 5:30pm EST apologizing and stating, in part,
Noise as well related to a different post X by Sahil Chaudhary, Founder of Glaive AIa platform that, in keeping with Shumer’s earlier claims, was used to generate synthetic data used to coach the Reflection 70B.
Interestingly enough, Chaudhary’s post he stated that some of the responses from Reflection 70B, saying it is a variant of Claude Anthropic, are still a mystery to him. He also admitted that “the benchmark results I shared with Matt have not been repeatable so far.” Read his full post below:
However, Shumer and Chaudhary’s responses weren’t enough to appease skeptics and critics, including Yuchen Jin, co-founder and chief technology officer (CTO) of the company Hyperbolic laboratoriesproviders of open-access artificial intelligence (AI) cloud solutions.
Jin wrote long post about X detailing how hard he worked to get the Reflection 70B version up on his site and fix the alleged bugs, noting that “I was emotionally hurt by it because we put so much time and energy into it, so I tweeted what my faces looked like over the weekend.”
He also responded to Shumer’s statement with the following words: response to X, writing: “Hi Matt, We’ve spent a lot of time, energy, and GPUs hosting your model, and I’m sad you haven’t responded in the last 30 hours. I think you could be more transparent about what happened (especially why your private API has so much better performance).”
Megami Boson, like many others, was unconvinced by Shumer and Chaudhary tonight in terms of the way they described the events, presenting the saga as a collection of mysterious, still unexplained errors born of enthusiasm.
“As far as I can tell, either you’re lying, or Matt Shumer, or both, of course,” he wrote on X, then asked a series of questions. Similarly, the Local Llama subreddit isn’t buying Shumer’s claims:
It stays to be seen whether Shumer and Chaudhary will give you the chance to reply satisfactorily to their critics and skeptics, who include a growing number of members of the online generative AI community.