Google DeepMind recently announced PaLM 2, a large language model (LLM) powering Bard and over 25 other product features. PaLM 2 significantly outperforms the previous version of PaLM on a wide range of benchmarks, while being smaller and cheaper to run.
Google CEO Sundar Pichai announced the model at Google I/O '23. PaLM 2 performs well on a variety of tasks including code generation, reasoning, and multilingual processing, and it is available in four different model sizes, including a lightweight version called Gecko that is intended for use on mobile devices. When evaluated on NLP benchmarks, PaLM 2 showed performance improvements over PaLM, and achieved new state-of-the-art levels in many tasks, especially on the BIG-bench benchmark. Besides powering Bard, the new model is also a foundation for many other products, including Med-PaLM 2, a LLM fine-tuned for the medical domain, and Sec-PaLM, a model for cybersecurity. According to Google,
PaLM 2 shows us the impact of highly capable models of various sizes and speeds---and that versatile AI models reap real benefits for everyone. Yet just as we’re committed to releasing the most helpful and responsible AI tools today, we’re also working to create the best foundation models yet for Google.
In 2022, InfoQ covered the original release of Pathways Language Model (PaLM), a 540-billion-parameter large language model (LLM). PaLM achieved state-of-the-art performance on several reasoning benchmarks and also exhibited capabilities on two novel reasoning tasks: logical inference and explaining a joke.
For PaLM 2, Google implemented several changes to improve model performance. First, they studied model scaling laws to determine the optimal combination of training compute, model size, and data size. They found that, for a given compute budget, data and model size should be scaled "roughly 1:1," whereas previous researchers had scaled model size 3x the data size.
The team improved PaLM 2's multilingual capabilities by including more languages in the training dataset and updating the model training objective. The original dataset was "dominated" by English; the new dataset pulls from a more diverse set of languages and domains. Instead of using only a language modeling objective, PaLM 2 was trained using a "tuned mixture" of several objectives.
Google evaluated PaLM 2 on six broad classes of NLP benchmark, including: reasoning, coding, translation, question answering, classification, and natural language generation. The focus of the evaluation was to compare its performance to the original PaLM. On BIG-bench, PaLM 2 showed "large improvements," and on classification and question answering even the smallest PaLM 2 model achieved performance "competitive" with much the larger PaLM model. On reasoning tasks, PaLM 2 was also "competitive" with GPT-4; it outperformed GPT-4 on the GSM8K mathematical reasoning benchmark.
In a Reddit discussion about the model, several users commented that although its output wasn't as good as that from GPT-4, PaLM 2 was noticeably better. One user said:
They probably want it to be scalable so they can implement it for free/low cost with their products. Also so it can accompany search results without taking forever (I use GPT 4 all the time and love it, but it is pretty slow.)...I just used the new Bard (which is based on PaLM 2) and it's a good amount faster than even GPT 3.5 turbo.
The PaLM 2 tech report page on Papers with Code lists the model's performance on several NLP benchmarks.