Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Microsoft's Orca 2 LLM Outperforms Models That Are 10x Larger

Microsoft's Orca 2 LLM Outperforms Models That Are 10x Larger

This item in japanese

Microsoft Research released its Orca 2 LLM, a fine-tuned version of Llama 2 that performs as well as or better than models that contain 10x the number of parameters. Orca 2 uses a synthetic training dataset and a new technique called Prompt Erasure to achieve this performance.

Orca 2 models are trained using a teacher-student scheme, where a larger, more powerful LLM acts as a teacher for a smaller student LLM, with the goal of improving the performance of the student to be comparable with that of a larger model. Microsoft's training technique teaches the smaller model multiple reasoning techniques and also how to choose the most effective technique for a given task. To do this, the teacher is given sophisticated prompts to trigger a certain reasoning behavior. However, in a scheme called Prompt Erasure, the student is given only the task requirements and desired response, but not the teacher's prompt. When evaluated on benchmarks, a 13B parameter Orca 2 model outperformed a baseline 13B parameter Llama 2 by 47.54%. The 7B parameter Orca 2 was "better or comparable" to a 70B parameter Llama 2 on reasoning tasks.

Although LLMs like ChatGPT can often perform well on a wide range of tasks with few-shot prompting, hosting the models is challenging due to their memory and compute requirements. Smaller models can also perform well when fine-tuned, and many researchers have investigated training them with synthetic datasets generated by larger LLMs. InfoQ recently covered Google's Distilling Step-by-Step method which prompts a teacher LLM to automatically generate a small fine-tuning dataset that contains both an input with an output label, as well as a "rationale" for why the output label was chosen. InfoQ also covered Stability AI's Stable Beluga model which is trained using Microsoft's original Orca 1 scheme, which uses Explanation Tuning, where the teacher LLM is prompted to "generate detailed answers."

Like Orca 1, the Orca 2 training dataset is generated by a teacher LLM which is given a detailed prompt. However, the new approach, which Microsoft dubs Cautious Reasoning, pairs training tasks with prompts which elicit the teacher to use a specific problem solving strategy, such as "step-by-step" or "explain your answer." Then during training of the student, the teacher's prompt is erased, which pushes the student to learn to pick the correct strategy.

To evaluate the methodology, Microsoft compared Orca 2 model performance to several baseline models, including Llama 2, ChatGPT (GPT-3.5) and GPT-4. The benchmark tasks included reasoning, language understanding, text completion, and summarization. On the reasoning benchmarks, the 13B parameter Orca 2 model outperformed all baselines except ChatGPT and GPT-4. They also found that giving Orca 2 a "cautious" system prompt ("You are a cautious assistant. You carefully follow instructions.") gave it a small performance boost compared to an empty system prompt.

Several users posted about Orca 2 on X. One noted that "[Y]ou do not need to prompt it with tricks like 'explain step by step.' It just knows." AI researcher Rudi Ranck wrote:

Many brilliant ideas are so simple...Like "Prompt Erasure" in Orca 2: Instead of presenting the entire prompt, only the task and the answer are shown to the model (it filters the full prompt used to generate those answers). It helps the model to strategize at a higher level. Such a nice paper. I highly recommend reading it all the way through.

The 7B and 13B parameter Orca 2 models are available on Huggingface.

About the Author

Rate this Article