LMSYS Launches 'Multimodal Arena': GPT-4 Tops Leaderboard, But AI Still Can't Outsmart Humans

LMSYS the organization launched its “Multimodal arena” today, a latest leaderboard comparing the performance of AI models on vision tasks. Arena collected over 17,000 user preference votes in over 60 languages in just two weeks, offering insight into the current state of AI visual processing capabilities.

?Great news – we are pleased to announce the Vision Leaderboard rankings of the Chatbot Arena competition!
Over the last 2 weeks, we have collected over 17,000. votes for different use cases.
Overview of the most significant events:
– GPT-4o is in the lead, followed by Claude 3.5 Sonnet on #2 and Gemini 1.5 Pro on #3
– Open model… https://t.co/lDu0QpJ5yh pic.twitter.com/G2D7oJjNhF
— lmsys.org (@lmsysorg) June 28, 2024

OpenAI’s GPT-4o model secured the top spot in the multimodal arena, closely followed by Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro. This rating reflects the fierce competition between technology giants for dominance in the rapidly evolving field of multimodal artificial intelligence.

- Advertisement -

It is price noting that the open-source model LLaVA-v1.6-34B achieved results comparable to some proprietary models reminiscent of the Claude 3 Haiku. This development signals a potential democratization of advanced AI capabilities, potentially leveling the playing field for researchers and smaller firms that lack the resources of huge technology firms.

This scoreboard covers a number of tasks, from captioning images and solving math problems to understanding documents and interpreting memes. The goal of this broad scope is to supply a holistic view of each model’s visual processing capabilities, reflecting the complex requirements of real-world applications.

Countdown to VB Transform 2024

Join enterprise leaders in San Francisco July 9–11 for our signature AI event. Connect with peers, explore the opportunities and challenges of generative AI, and learn the way to integrate AI applications into your industry. Register now

Reality check: AI still struggles with complex visual reasoning

While Multimodal arena offers helpful insights, but it primarily measures user preferences, not objective accuracy. A more sobering picture emerges from the recently introduced CharXiv benchmarkdeveloped by researchers at Princeton University to guage the performance of AI in understanding graphs from scientific papers.

CharXiv’s results reveal significant limitations to current AI capabilities. The best performing model, GPT-4o, achieved only 47.1% accuracy, while the best open-source model achieved only 29.2%. These results pale in comparison to human performance of 80.5%, highlighting the significant gap that continues to be in AI’s ability to interpret complex visual data.

? Are multimodal multilingual models really as ???? On ??????????????? as existing benchmarks like ChartQA suggest?
? Our ℂ????????? benchmark suggests NO!
“People achieve ✨??+% correctness.
?Sonet 3.5 outperforms GPT-4o by 10+ points,… pic.twitter.com/C9YXefYfSz
— Zirui “Colin” Wang (@zwcolin) June 27, 2024

This discrepancy highlights a key challenge in AI development: While models have made impressive progress on tasks like object recognition and basic image captions, they still struggle with the nuanced reasoning and contextual understanding that humans effortlessly apply to visual information.

Bridging the Gaps: The Next Frontier in AI Vision

Activation Multimodal arena and insights from benchmarks reminiscent of CharXiv at a pivotal moment for the AI industry. As firms seek to integrate multimodal AI capabilities into products ranging from virtual assistants to autonomous vehicles, understanding the true limitations of those systems is becoming increasingly vital.

These benchmarks function a reality check and temper the often hyperbolic claims about AI’s capabilities. They also provide a roadmap for researchers, highlighting specific areas where improvements are needed to realize human-level visual understanding.

The difference between artificial intelligence and human performance in complex visual tasks presents each a challenge and an opportunity. He suggests that significant breakthroughs in AI architecture or training methods could also be vital to realize truly robust visual intelligence. At the same time, it opens up exciting opportunities for innovation in fields reminiscent of computer vision, natural language processing and cognitive science.

As the AI community digests these discoveries, we are able to expect to see a renewed focus on developing models that not only see, but also truly understand the visual world. The race is on to create artificial intelligence systems that match, and perhaps one day exceed, human understanding of even the most complex visual reasoning tasks.

VB every day

Stay updated! Get the latest news in your inbox every day

By subscribing, you comply with the VentureBeat Terms of Service.

Thanks for subscribing. Find more VB newsletters here.

An error occured.

DELA boss, he leaves among the spying of the lawsuit

The mayor of SF Lurie for the General Directors of Tech: “How can we regain you?”

The Web3 startup is aimed at changing the game in tariff wars

Upper side hustle and bustle in your city? Here is the fastest growing concert

Runway, best known for its AI models generating video, collects USD 308 million

3 Tips on how to choose a trustworthy business partner each time

How I turned a falling company into a power worth $ 1 million in just 6 months

3 stages of business development, which every founder should know

Give up the description of the position – 4 bold leadership movements

By 2027, most employees will be a freelancer. Are you ready?

Your words only tell a fraction of history – here’s why the tone and body language really are more important

How to get a promotion using a 3-stage preparation strategy

This is a leader superpower from 2025 – do you have what you need?

Most people make this career mistake. Are you guilty of him?

One thing that ruins your business faster than anything else

Q1 Global Startup Funding will publish the strongest quarter from KW. 2 2022

Start funding is slowed down in February in connection with the uncertainty of the exit

The largest funding rounds of the week: Massive List of Saronic peaks

Nih funding uncertainty Spurs New Biotech Venture Fund

Cleantech Funding for a slow start in 2025

LMSYS Launches ‘Multimodal Arena’: GPT-4 Tops Leaderboard, But AI Still Can’t Outsmart Humans

Reality check: AI still struggles with complex visual reasoning

Bridging the Gaps: The Next Frontier in AI Vision

Latest Posts

Minecraft movie goes on a global opening day (110) USD 110...

3 Tips on how to choose a trustworthy business partner each...

Your words only tell a fraction of history – here’s why...

Types of interference that can cause dangerous car accidents

Minecraft movie goes on a global opening day (110) USD 110...

Super Agent Genspark raises the stake in the AI General Agent...

Sandboxaq adds USD 150 million from Google, Nvidia and others

Openai has just made chatgpt plus for free for millions of...

Recomended

Minecraft movie goes on a global opening day (110) USD 110 million and is heading above

3 Tips on how to choose a trustworthy business partner each time

Your words only tell a fraction of history – here’s why the tone and body language really are more important

Types of interference that can cause dangerous car accidents

How to get a promotion using a 3-stage preparation strategy

Why is it important to act quickly after a bicycle accident