Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News LMSYS Org Releases Chatbot Arena and LLM Evaluation Datasets

LMSYS Org Releases Chatbot Arena and LLM Evaluation Datasets

Large Model Systems Organization (LMSYS Org) recently released Chatbot Arena, a comparison platform for large language models (LLMs), where users can pick the better response from a pair of chatbots. LMSYS also released a dataset containing conversations from the Arena as well as a dataset of human annotations of results from evaluating LLMs on the MT-Bench benchmark.

LMSYS Org created Chatbot Arena earlier this year to "crowdsource" an evaluation of several different open- and closed-source LLMs, including GPT-4 and LLaMA. The Arena produced a leaderboard of models, ranking them according to their Elo rating. Because this method was time-consuming, the LMSYS team developed an additional benchmark, MT-bench, which consists of 80 multi-turn questions to ask a chatbot, with the chatbot's responses graded by GPT-4. According to LMSYS Org: 

[We] have shown that MT-Bench effectively differentiates between chatbots of varying capabilities. It's scalable, offers valuable insights with category breakdowns, and provides explainability for human judges to verify. However, LLM judges should be used carefully. It can still make errors, especially when grading math/reasoning questions.

The rise of LLMs has led to a need for new benchmarks to measure their abilities, as the models have achieved superhuman performance on traditional ones like GLUE. The Massive Multitask Language Understanding (MMLU) benchmark can measure a LLM's knowledge capabilities, but it does not measure how well the LLM can produce output that is aligned with human preference, which is the feature that new models such as ChatGPT are pursuing.

Earlier this year, LMSYS Org released their Vicuna LLM, a fine-tuned version of Meta's LLaMA model. To evaluate Vicuna, the researchers used GPT-4 as a judge of its output, and claimed that Vicuna achieved "more than 90% quality" of ChatGPT and Bard. Within a few months, LMSYS Org announced the ChatBot Arena, as an attempt to crowdsource the evaluation of models. Users would interact with two different models at once and choose which one they preferred; the result is an Elo rating of models. In this latest move, LMSYS Org is releasing a dataset of 33K Arena chatbot conversations with humans.

After running the Arena for several months, the researchers identified 8 categories of user prompts, including math, reasoning, and STEM knowledge. They created 10 multi-turn questions for each category, producing MT-Bench, a "quality-controlled complement" to the Arena. They again used GPT-4 to grade a chatbot's responses to the benchmark prompts, and found that the GPT-4 judge agreed with human judges more than 80% of the time, which was similar to how often two different human judges agreed. GPT-4's explanations for its choice could even persuade human judges to change their picks 34% of the time. LMSYS Org has now released a dataset of 3.3k "expert-level pairwise human preferences" for responses generated by six different models.

ML researcher Nathan Lambert discussed the work on Twitter, pointing out that the MT-Bench score "seems like the clearest benchmark to optimize" for researchers trying to produce models that match leaders like GPT-4. MT-Bench co-author Wei-Lin Chiang also answered several user questions on Twitter. In response to a question about correctly using models when evaluating them, Chiang replied:

That's a great point. We try our best to find the official template if it exists...But lack of standard and LLM's sensitivity to the template is definitely an issue.

The Chatbot Arena and MT-Bench evaluation code are available on GitHub. The Arena conversation dataset and MT-Bench response dataset are available on Huggingface, as is the current LLM Leaderboard.

About the Author

Rate this Article