
• Top-Skoring AI boosted accuracy 183% in two weeks.
• Chatgpt O3-Mini now scores 13% accuracy based on capacity
• OpenAI Deep Research eliminates competition with a result of 26.6% accuracy
The difficult AI evaluation referred to as Humanity’s Last Exam became introduced just beneath weeks in the past, and we’ve already witnessed a excellent increase in accuracy. ChatGPT o3-mini and OpenAI’s Deep Reasoning are presently main the percent. This AI benchmark, advanced with the aid of worldwide experts, features some of the maximum difficult reasoning demanding situations and inquiries imaginable.
It’s so complicated that when I formerly discussed Humanity’s Last Exam inside the related article, I located myself not able to realize one of the questions, let alone offer a solution. At the time of that article, the surprising DeepSeek R1 was on the pinnacle of the leaderboard with a nine.4% accuracy rating based completely on textual content assessment (no longer multi-modal). Fast ahead to this week, and OpenAI’s o3-mini has achieved a ten.5% accuracy at the o3-mini setting and a good higher 13% at the o3-mini-excessive setting, that’s more superior however calls for extra time to generate responses. Even greater noteworthy is the overall performance of OpenAI’s new AI agent, Deep Research, which scored an outstanding 26.6% at the benchmark.
It looks like the latest OpenAI model is very doing well across many topics.
My guess is that Deep Research particularly helps with subjects including medicine, classics, and law. pic.twitter.com/x8Ilmq1aQS— Dan Hendrycks (@DanHendrycks) February 3, 2025
This represents a astonishing 183% improvement in accuracy in much less than ten days. It’s crucial to mention that Deep Research has search abilities, which gives it an edge over other AI models that lack this option. The capacity to get admission to the web is mainly nice for a check like Humanity’s Last Exam, because it includes questions that require fashionable knowledge.
The consequences from models reading Humanity’s Last Exam are showing steady improvement, raising the question of how lengthy it’ll be earlier than an AI model can nearly meet the benchmark. While it’s not going that AI will reach that level each time soon, I wouldn’t rely it out.
Better, however 26.6% by no means were given me any SATs
Openai Deep Research is a notable tool, and I was really influenced by the performances provided during the announcement of AI agent. Intensive Research acts as your individual analyst, dedicate time to fully research and generates reports and answers that usually takes an important time to complete humans.
Although the score of 26.6% is quite remarkable in the previous examination of humanity, especially given the rapid progress seen on the leaderboard of the benchmark in a few weeks, it is still completely low – no one below 50% below 50% Will not consider the score. A real world landscape.
The final examination of humanity acts as an excellent benchmark that will be important with the AI model developing, allowing us to measure their progress. How long will we take before crossing 50% of the threshold? And which model will first get that milestone?