AI Researchers Improve LLM-Based Reasoning by Mimicking Learning from Mistakes

Researchers from Microsoft, Peking University, and Xi’an Jiaotong University claim to have developed a technique to improve large language models' (LLMs) ability to solve math problems by replicating how humans learn from their own mistakes.

According to the researchers, while LLMs have been shown able to solve problems step-by-step, this does not mean they possess reasoning capabilities.

They may merely emulate the superficial behavior of human reasoning without genuinely comprehending the underlying logic and rules necessary for precise reasoning. This incomprehension results in mistakes during the reasoning process and necessitates the assistance of a "world model" that possesses a consciousness prior about the logic and rules governing the real world.

Dubbed LeMa (Learning from Mistakes), the approach they propose consists of using GPT-4 as a kind of "corrector" for inaccurate reasoning generated by various LLMs. For example, LeMa is able to provide a correct solution to a problem like the following:

James creates a media empire. He creates a movie for $2000. Each DVD cost $6 to make. He sells it for 2.5 times that much. He sells 500 movies a day for 5 days a week. How much profit does he make in 20 weeks?

In the first step, GPT-4 identifies the mistake. In the second step, GPT-4 provides an explanation about what caused the mistake. Finally, GPT-4 corrects the mistake and generates a new answer.

LeMa may fail at any of the above steps, which leads to classifying corrections into three groups, based on their quality: excellent, good, or poor. The researchers found out that 35 out of 50 generated corrections are of excellent quality, 11 are good, and 4 are poor.

All correct corrections are eventually fed back into the LLMs that produced the original answers to fine-tune them.

The team tested their approach on two math reasoning tasks, GSM8K and MATH, and found that it brings improvements in comparison with previous approaches. LeMa also improved the performance of specialized LLMs like WizardMath and MetaMath, achieving 85.4% pass@1 accuracy on GSM8K and 27.1% on MATH.

In other interesting findings, GPT-3.5-Turbo was shown to be not powerful enough to be used as a corrector in place of GPT-4. Likewise, while GPT-4 performed well on problems at the two lowest levels of difficulty, its correctness decreased as the difficulty increased, which shows there is still room for improvement.

As a final remark, the team made available their code, data, and models in a GitHub repository.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Sergio De Simone

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter