BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Hugging Face Upgrades Open LLM Leaderboard v2 for Enhanced AI Model Comparison

Hugging Face Upgrades Open LLM Leaderboard v2 for Enhanced AI Model Comparison

Hugging Face has recently released Open LLM Leaderboard v2, an upgraded version of their popular benchmarking platform for large language models.

Hugging Face created the Open LLM Leaderboard to provide a standardized evaluation setup for reference models, ensuring reproducible and comparable results.

The leaderboard serves multiple purposes for the AI community. It helps researchers and practitioners identify state-of-the-art open-source releases by providing reproducible scores that separate marketing claims from actual progress. It allows teams to evaluate their work, whether in pre-training or fine-tuning, by comparing methods openly against the best existing models. Additionally, it provides a platform for earning public recognition for advancements in LLM development.

The Open LLM Leaderboard has become a widely used resource in the machine learning community since its inception a year ago. According to Hugging Face, Open LLM Leaderboard has been visited by over 2 million unique users in the past 10 months, with around 300,000 community members actively collaborating on it monthly.

Open LLM Leaderboard v2 addresses limitations in the original version and keeps pace with rapid advancements in the open-source LLM field.

InfoQ spoke to Alina Lozovskaia, one of the Leaderboard maintainers at Hugging Face, to learn more about the motivation behind this update and its implications for the AI community.

InfoQ: You've changed the model ranking to use normalized scores, where random performance is 0 points and max score is 100 points, before averaging. How does this normalization method impact the relative weighting of each benchmark in the final score compared to just averaging raw scores?

Alina Lozovskaia: By normalizing each benchmark's scores to a scale where random performance is 0 and perfect performance is 100 before averaging, the relative weighting of each benchmark in the final score is adjusted based on how much a model's performance exceeds random chance. This method gives more weight to benchmarks where models perform close to random (harder benchmarks), highlighting small improvements over chance.

Conversely, benchmarks where models already score high in raw terms contribute proportionally less after normalization. As a result, the normalized averaging ensures that each benchmark influences the final score according to how much a model's performance surpasses mere guessing, leading to a fairer and more balanced overall ranking compared to simply averaging raw scores.

InfoQ: Benchmark data contamination has been an issue, with some models accidentally trained on data from TruthfulQA or GSM8K. What technical approaches are you taking to mitigate this for the new benchmarks? For example, are there ways to algorithmically detect potential contamination in model outputs?

Lozovskaia: In general, contamination detection is an active, but very recent research area: for example, the first workshop specifically on this topic occurred only this year, at ACL 2024 (it was the CONDA workshop that we sponsored). Since the field is very new, no algorithmic method is well established yet. We're therefore exploring emerging techniques (such as analyzing the likelihood of model outputs against uncontaminated references) though no method is without strong limitations at the moment.

We're also internally testing hypotheses to detect contamination specific to our Leaderboard and hope to share progress soon. We are also very thankful to our community, as we have also benefited a lot from their vigilance (users are always very quick to flag models with suspicious performance/likely contamination).

InfoQ: The MuSR benchmark seems to favor models with context window sizes of 10k tokens or higher. Do you anticipate a significant shift in LLM development towards this type of task?

Lozovskaia: There has been a recent trend towards extending the context length that LLMs can parse accurately, and improvements in this domain are going to be more and more important for a lot of business applications (extracting contents from several pages of documents, summarizing, accurately answering long discussions with users, etc).

We therefore have seen, and expect to see, more and more models with this long-context capability. However, general LLM development will likely balance this with other priorities like efficiency, task versatility, and performance on shorter-context tasks. One of the advantages of open-source models is that they allow everybody to get high performance on the specific use cases they need.

For those interested in exploring the world of large language models and their applications further, InfoQ offers "Large Language Models for Code," presented by Loubna Ben Allal at QCon London. Additionally, our AI, ML, and Data Engineering Trends Report for 2024 provides a comprehensive overview of the latest developments in the field.

About the Author

Rate this Article

Adoption
Style

BT