InfoQ Homepage Benchmark Content on InfoQ
-
Google Releases LMEval, an Open-Source Cross-Provider LLM Evaluation Tool
LMEval aims to help AI researchers and developers compare the performance of different large language models. Designed to be accurate, multimodal, and easy to use, LMEval has already been used to evaluate major models in terms of safety and security.
-
Mistral Unveils Medium 3: Enterprise-Ready Language Model
Mistral AI has unveiled Mistral Medium 3, a mid-sized language model aimed at enterprises seeking a balance between cost-efficiency, strong performance, and flexible deployment options. The model is now available through Mistral’s platform and Amazon SageMaker, with further releases planned for IBM WatsonX, Azure AI Foundry, Google Cloud Vertex AI, and NVIDIA NIM.
-
OpenAI Introduces GPT‑4.1 Family with Enhanced Performance and Long-Context Support
OpenAI has released a new family of language models—GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano—available via its API. The models improve on GPT‑4o and GPT‑4.5 across several technical benchmarks and introduce support for up to 1 million tokens of context.
-
OpenAI Launches BrowseComp to Benchmark AI Agents' Web Search and Deep Research Skills
OpenAI has released BrowseComp, a new benchmark designed to test AI agents' ability to locate difficult-to-find information on the web. The benchmark contains 1,266 challenging problems that require agents to persistently navigate through multiple websites to retrieve entangled information.
-
Google DeepMind Introduces QuestBench to Evaluate LLMs in Solving Logic and Math Problems
Google DeepMind’s QuestBench benchmark helps in evaluating if LLMs can pinpoint the single, crucial question needed to solve logic, planning, or math problems. DeepMind team recently published an article on QuestBench which is a set of underspecified reasoning tasks solvable by asking at most one question.
-
Radical AI Releases TorchSim: a PyTorch-Native Engine for Next-Generation Atomistic Simulations
Radical AI has announced the release of TorchSim, a next-generation atomistic simulation engine built natively in PyTorch and designed for the MLIP (machine-learned interatomic potentials) era.
-
Meta AI Releases Llama 4: Early Impressions and Community Feedback
Meta has officially released the first models in its new Llama 4 family—Scout and Maverick—marking a step forward in its open-weight large language model ecosystem. Designed with a native multimodal architecture and a mixture-of-experts (MoE) framework, these models aim to support a broader range of applications, from image understanding to long-context reasoning.
-
Google Introduces Gemini 2.5 Pro with Improved Reasoning and Coding Capabilities
Google has released Gemini 2.5 Pro, an updated AI model focused on enhanced reasoning, code generation, and multimodal processing. The model is ranked first on LMArena, a benchmark for human preference in AI responses, and achieves strong results in math, science, and logic-based tasks. It also features a 1 million token context window, with plans to expand to 2 million.
-
Google DeepMind Enhances AMIE for Long-Term Disease Management
Google DeepMind has extended the capabilities of its Articulate Medical Intelligence Explorer (AMIE) beyond diagnosis to support longitudinal disease management. The system is now designed to assist clinicians in monitoring disease progression, adjusting treatments, and adhering to clinical guidelines across multiple patient visits.
-
Mistral AI Introduces Saba: Regional Language Model for Arabic and South Indian Language
Mistral AI has introduced Mistral Saba, a 24-billion-parameter language model designed to improve AI performance in Arabic and several Indian-origin languages, particularly South Indian languages like Tamil.
-
Perplexity Unveils Deep Research: AI-Powered Tool for Advanced Analysis
Perplexity has introduced Deep Research, an AI-powered tool designed for conducting in-depth analysis across various fields, including finance, marketing, and technology. The system automates the research process by performing multiple searches, analyzing extensive sources, and synthesizing findings into structured reports within minutes.
-
OmniHuman-1: Advancing AI-Generated Human Animation
OmniHuman-1, an advanced AI-driven human video generation model, has been introduced, marking a significant leap in multimodal animation technology. OmniHuman-1 enables the creation of highly lifelike human videos using minimal input, such as a single image and motion cues like audio or video.
-
Microsoft Introduces CoRAG: Enhancing AI Retrieval with Iterative Reasoning
Microsoft AI has introduced Chain-of-Retrieval Augmented Generation (CoRAG), a new AI framework designed to enhance Retrieval-Augmented Generation (RAG) models. Unlike traditional RAG systems, which rely on a single retrieval step, CoRAG enables iterative search and reasoning, allowing AI models to refine their retrievals dynamically before generating answers.
-
Microsoft Research Unveils rStar-Math: Advancing Mathematical Reasoning in Small Language Models
Microsoft Research unveiled rStar-Math, a framework that demonstrates the ability of small language models (SLMs) to achieve mathematical reasoning capabilities comparable to, and in some cases exceeding, larger models like OpenAI's o1-mini. This is accomplished without the need for more advanced models, representing a novel approach to enhancing the inference capabilities of AI.
-
HuatuoGPT-o1: Advancing Complex Medical Reasoning with AI
Researchers from The Chinese University of Hong Kong, Shenzhen, and the Shenzhen Research Institute of Big Data have introduced HuatuoGPT-o1, a medical large language model (LLM) designed to improve reasoning in complex healthcare scenarios.