Google's Gated Multi-Layer Perceptron Outperforms Transformers Using Fewer Parameters

Researchers at Google Brain have announced Gated Multi-Layer Perceptron (gMLP), a deep-learning model that contains only basic multi-layer perceptrons. Using fewer parameters, gMLP outperforms Transformer models on natural-language processing (NLP) tasks and achieves comparable accuracy on computer vision (CV) tasks.

The model and experiments were described in a paper published on arXiv. To investigate the necessity of the Transformer's self-attention mechanism, the team designed gMLP using only basic MLP layers combined with gating, then compared its performance on vision and language tasks to previous Transformer implementations. On the ImageNet image classification task, gMLP achieves an accuracy of 81.6, comparable to Vision Transformers (ViT) at 81.8, while using fewer parameters and FLOPs. For NLP tasks, gMLP achieves a better pre-training perplexity compared with BERT, and a higher F1 score on the SQuAD benchmark: 85.4 compared to BERT's 81.8, while using fewer parameters.

First described by Google researchers in 2017, the Transformer architecture has become the leading design for NLP deep-learning models, with OpenAI's GPT-3 being one of the most famous. Transformers have also begun achieving good results on vision tasks; in particular, Google's ViT recently achieved state-of-the-art results on the ImageNet benchmark. The key feature of the Transformer is the multi-head self-attention mechanism, which learns spatial interactions between elements of a sequence. Now researchers are questioning if this mechanism is necessary to the Transformer's high performance.

MLPs are considered the "classic" form of a neural network, consisting simply of a series of fully-connected layers or perceptrons, which were first developed in 1958. The main innovation in gMLP is a Spatial Gating Unit (SGU) which captures the interactions across sequence elements; this performs the same role as attention in a Transformer, but without requiring encodings for element positions. Instead, the SGU performs element-wise multiplication of its input with a linear projection of that same input. The researchers found that for training stability, it was necessary to initialize the weights of the gate to 1, which in effect makes the gate a simple pass-through; in the course of training, the weights are updated to learn the spatial interactions between sequence elements.

In a set of experiments, the team trained several gMLP models of various sizes and compared their performance to similarly-sized Transformer-based models on CV and NLP benchmarks. On the ImageNet benchmark, they compared performance against ViT as well as other recently-published MLP-based CV models. Although neither ViT nor MLP models perform as well as state-of-the-art convolutional neural network (CNN) models, gMLP is comparable to Transformers and outperforms all MLP models. For NLP performance, the team compared their model to BERT and another MLP-based language model. They found that the base gMLP model could achieve performance on par with BERT when scaled sufficiently; however, with the addition of a "tiny bit of self-attention" gMLP could outperform BERT on all tasks, with fewer parameters.

Researchers from several organizations have recently investigated MLP-based models for vision and language. Earlier this year, a separate team at Google Brain open-sourced MLP-Mixer, a CV model that uses MLP layers to "mix" features of image patches to learn spatial relationships. Facebook AI Research developed the Residual Multi-Layer Perceptrons (ResMLP) architecture, another vision model inspired by ViT that replaces self-attention with a simple MLP layer. In the academic world, a researcher at Oxford University open-sourced a similar model, and a team from Tsinghua University open-sourced RepMLP, a model that re-parameterizes CNN layers into MLP layers for faster inference.

In a discussion on Reddit, one user commented on Google's efforts to improve gMLP's training stability:

The paper also makes a number of remarks about stability and the methods they used to increase it, in particular, during the start of training. Maybe there's a utility in including the stability/robustness of a model as a figure of merit?

Although Google has not released the gMLP code, several readers of the paper have open-sourced their own implementations.

Topics

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Cell-Based Architecture Adoption Guidelines

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Cloudflare Introduces Workflows for Building Scalable Resilient Multi-Step Applications

Cloudflare Introduces Short-Lived SSH Access, Eliminating the Need for SSH Credentials

Microsoft Introduces Modern Web App Pattern for .NET: Accelerating App Modernization to the Cloud

Apache Tomcat 11.0 Delivers Support for Virtual Threads and Jakarta EE 11

AWS Lambda Introduces a Visual Studio Code-Based Editor with Advanced Features and AI Integration

Generally AI - Season 2 - Episode 5: Do Robots Dream of Electric Pianos?

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Steve Klabnik and Herb Sutter Talk about Rust and C++

Challenges and Lessons Porting Code from C to Rust

Grab Employs LLMs for Conversational Data Discovery with GPT-4, Glean and Slack

Cell-Based Architecture Adoption Guidelines

Software Architecture Tracks at QCon San Francisco 2024 – Navigating Current Challenges and Trends

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

What Developers Can Do to Continue to Program as They Age

How Rules Can Foster Creativity: The Design System of Reykjavík

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

OSI Releases New Definition for Open Source AI, Setting Standards for Transparency and Accessibility

Being a Responsible Developer in the Age of AI Hype

Optimizing Uber's Search Infrastructure: Upgrading to Apache Lucene 9.5

Improving the Efficiency of Goku Time-Series Database at Pinterest

Expedia Migrates a Massive Cassandra Cluster to ScyllaDB with Zero Downtime

QCon San Francisco

QCon London

InfoQ Dev Summit Boston

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?