Google's Gated Multi-Layer Perceptron Outperforms Transformers Using Fewer Parameters

Researchers at Google Brain have announced Gated Multi-Layer Perceptron (gMLP), a deep-learning model that contains only basic multi-layer perceptrons. Using fewer parameters, gMLP outperforms Transformer models on natural-language processing (NLP) tasks and achieves comparable accuracy on computer vision (CV) tasks.

The model and experiments were described in a paper published on arXiv. To investigate the necessity of the Transformer's self-attention mechanism, the team designed gMLP using only basic MLP layers combined with gating, then compared its performance on vision and language tasks to previous Transformer implementations. On the ImageNet image classification task, gMLP achieves an accuracy of 81.6, comparable to Vision Transformers (ViT) at 81.8, while using fewer parameters and FLOPs. For NLP tasks, gMLP achieves a better pre-training perplexity compared with BERT, and a higher F1 score on the SQuAD benchmark: 85.4 compared to BERT's 81.8, while using fewer parameters.

First described by Google researchers in 2017, the Transformer architecture has become the leading design for NLP deep-learning models, with OpenAI's GPT-3 being one of the most famous. Transformers have also begun achieving good results on vision tasks; in particular, Google's ViT recently achieved state-of-the-art results on the ImageNet benchmark. The key feature of the Transformer is the multi-head self-attention mechanism, which learns spatial interactions between elements of a sequence. Now researchers are questioning if this mechanism is necessary to the Transformer's high performance.

MLPs are considered the "classic" form of a neural network, consisting simply of a series of fully-connected layers or perceptrons, which were first developed in 1958. The main innovation in gMLP is a Spatial Gating Unit (SGU) which captures the interactions across sequence elements; this performs the same role as attention in a Transformer, but without requiring encodings for element positions. Instead, the SGU performs element-wise multiplication of its input with a linear projection of that same input. The researchers found that for training stability, it was necessary to initialize the weights of the gate to 1, which in effect makes the gate a simple pass-through; in the course of training, the weights are updated to learn the spatial interactions between sequence elements.

In a set of experiments, the team trained several gMLP models of various sizes and compared their performance to similarly-sized Transformer-based models on CV and NLP benchmarks. On the ImageNet benchmark, they compared performance against ViT as well as other recently-published MLP-based CV models. Although neither ViT nor MLP models perform as well as state-of-the-art convolutional neural network (CNN) models, gMLP is comparable to Transformers and outperforms all MLP models. For NLP performance, the team compared their model to BERT and another MLP-based language model. They found that the base gMLP model could achieve performance on par with BERT when scaled sufficiently; however, with the addition of a "tiny bit of self-attention" gMLP could outperform BERT on all tasks, with fewer parameters.

Researchers from several organizations have recently investigated MLP-based models for vision and language. Earlier this year, a separate team at Google Brain open-sourced MLP-Mixer, a CV model that uses MLP layers to "mix" features of image patches to learn spatial relationships. Facebook AI Research developed the Residual Multi-Layer Perceptrons (ResMLP) architecture, another vision model inspired by ViT that replaces self-attention with a simple MLP layer. In the academic world, a researcher at Oxford University open-sourced a similar model, and a team from Tsinghua University open-sourced RepMLP, a model that re-parameterizes CNN layers into MLP layers for faster inference.

In a discussion on Reddit, one user commented on Google's efforts to improve gMLP's training stability:

The paper also makes a number of remarks about stability and the methods they used to increase it, in particular, during the start of training. Maybe there's a utility in including the stability/robustness of a model as a figure of merit?

Although Google has not released the gMLP code, several readers of the paper have open-sourced their own implementations.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter