OpenAI’s GPT-4 has stood as the pinnacle of all AI endeavours for most of 2022, establishing standards for the capabilities of generative AI in comprehending and producing text resembling human writing. Despite efforts from various subsequent Large Language Models (LLMs) like Google’s Gemini series, Anthropic’s Claude 2, Meta’s Llama series and Mistral’s Mistral Large, none succeeded in surpassing the dominance established by GPT-4.
But that has changed since Claude 3 models were announced earlier in March. Anthropic’s Claude 3 Opus and Sonnet have been making huge waves in the last two weeks with their impressive performances, which seems to have surprisingly outclassed OpenAI’s GPT-4 to be become the best Large Language Model (LLM) in existence today. The consensus on this observation has been coming for entrepreneurs, leading AI experts and even the academia across the world.
“When it comes to contextual searching, the advent of GPT-4 brought its own challenges, including hallucinations. However, with Claude 3, there’s been a notable improvement in inference speed and response quality,” CodeMate Founder Ayush Singhal told BW Businessworld.
He noted that GPT-4 suffered from sluggish performance despite decent quality responses, leading developers to prefer GPT-3.5 for its faster speed.
“Claude 3 addresses this issue by delivering rapid responses without compromising on quality. In fact, in some cases, its responses surpass the quality of GPT-4,” Singhal added.
According to Anthropic, its flagship model, Opus, has demonstrated superior performance compared to its counterparts across various standard evaluation benchmarks for AI systems. These benchmarks cover a wide range of tasks, including undergraduate-level expert knowledge (MMLU), graduate-level expert reasoning (GPQA), basic mathematics (GSM8K) and others. The model also shows exceptional comprehension and fluency in tackling intricate tasks, nearly reaching human-like levels of proficiency, thus positioning itself at the forefront of advancing general intelligence.
“After 100s of queries myself as a user, with Claude 3 (Opus and Sonnet) as the default model on Perplexity, I’m yet to see a hallucination. I couldn’t have said this for GPT 4,” Aravind Srinivas, co-founder and CEO at Perplexity AI tweeted recently.
While Claude 3 Opus and Sonnet models are being praised widely for beating GPT-4 across various AI benchmarks, the newly released Claude Haiku model (the most affordable, fastest but less accurate) is also being said to be better than GPT-4.
According to Wyze Research Scientist Mohammad Mahdi Kamani, the Haiku model beats all models from OpenAI in his company’s RAG benchmark. “This model is almost half the price of GPT 3.5 turbo!” Kamani tweeted on 14 March.
Earlier in the month, Kamani evaluated performance of Claude models against GPT-4 using Wyze’s internal benchmarks and knowledge base. “Scoring is from 0 to 4, with GPT-4 as the judge! Claude models' responses seem more natural,” tweeted Kamani.
With the introduction of Claude 3, the landscape of Large Language Models (LLMs) appears to be experiencing heightened competition for supremacy. And there has been a notable upsurge in both founders and users considering a transition away from GPT-4 and OpenAI would be cognizant of this shift.
“OpenAI’s GPT-4 once epitomized the zenith of these efforts, setting benchmarks for what generative AI could achieve in terms of understanding and generating human-like text. Many subsequent LLMs, including Google’s Gemini series, Anthropic’s Claude 2, Meta’s Llama series and Mistral AI’s Mistral Large, continued to challenge the dominance of GPT-4, yet failed,” wrote Wei Sun, Counterpoint Research analyst, on her blog. “However, the ascendancy of Anthropic’s Claude 3 signifies a paradigm shift to a new era. Now the battlefield has become multi-polarized.”
As per Sun’s suggestion, Claude 3’s emergence has possibly brought in the second phase of LLM race. However, we would have wait and watch whether Anthropic can sustain its newfound lead as OpenAI gears up to release GPT-4.5 in June (if reports on a leaked OpenAI blog are to be believed).
NOTE: RAG benchmarking is used by researchers, developers and stakeholders to systematically evaluate LLMs’ performance, identify areas for improvement and track progress over time. It provides a structured framework for assessing model behaviour and guiding efforts to enhance model quality, fairness and ethical considerations.