InfoQ Homepage News Perceiver: One Neural-Network Model for Multiple Input Data Types

Perceiver: One Neural-Network Model for Multiple Input Data Types


Google's DeepMind company has recently released a state-of-the-art deep-learning model called Perceiver that receives and processes multiple input data ranging from audio to images, similarly to how the human brain perceives multimodal data.

Perceiver is able to receive and classify input multiple data types, namely point cloud, audio and images. For this purpose, the deep-learning model is based on transformers (a.k.a. attention), which make no assumptions about the input data type. 

Usually the bottleneck of using transformers is the quadratic number of operations needed for algorithms. For instance, processing an image measuring 224 pixels by 224 pixels could lead to 224^2 operations, over 50,000, which is a huge computational overhead. To sort this problem, DeepMind researchers replaced the self-attention layer with a cross-attention layer in the transformer, resulting in a linear algorithm complexity.

Source: Perceiver: General Perception with Iterative Attention

In addition, the input data used to compute cross attention is converted into a byte array, which means this model is agnostic to the data type. 

The great breakthrough about this model is that it makes no assumption about input data type, while, for instance, existing convolutional neural networks work for images only. 

Source: Perceiver: General Perception with Iterative Attention

For image classification, this model achieves state-of-the-art accuracy on ImageNet of 76.4% (while ResNet achieves 39.4%).

Source: Perceiver: General Perception with Iterative Attention

Perceiver got attention on social media, having thousands of views on YouTube, a thread discussion on Reddit and Twitter ongoing discussion. There is an interesting comment on a Reddit thread that show the relevance of this new model:

The basic idea, as I understand it, is to achieve cross-domain generality by recreating the MLP with transformers, where

  • "neurons" and activations are vectors not scalars, and
  • interlayer weights are dynamic, not fixed

You can also reduce input dimensionality by applying cross-attention to a fixed set of learned vectors. Pretty cool.

In addition, there is a researcher insight on Twitter thread:

This is really great work. There is a community implementation too.
Definitely going to be playing around with this. Thanks for the paper.


Finally, there is an open-source implementation in PyTorch by members of the deep-learning community. In order to use it, you can use the following snippet:

import torch
from perceiver_pytorch import Perceiver

model = Perceiver(
    input_channels = 3,          # number of channels for each token of the input
    input_axis = 2,              # number of axis for input data (2 for images, 3 for video)
    num_freq_bands = 6,          # number of freq bands, with original value (2 * K + 1)
    max_freq = 10.,              # maximum frequency, hyperparameter depending on how fine the data is
    depth = 6,                   # depth of net
    num_latents = 256,           # number of latents, or induced set points, or centroids. different papers giving it different names
    latent_dim = 512,            # latent dimension
    cross_heads = 1,             # number of heads for cross attention. paper said 1
    latent_heads = 8,            # number of heads for latent self attention, 8
    cross_dim_head = 64,
    latent_dim_head = 64,
    num_classes = 1000,          # output number of classes
    attn_dropout = 0.,
    ff_dropout = 0.,
    weight_tie_layers = False    # whether to weight tie layers (optional, as indicated in the diagram)

img = torch.randn(1, 224, 224, 3) # 1 imagenet image, pixelized

model(img) # (1, 1000)


We need your feedback

How might we improve InfoQ for you

Thank you for being an InfoQ reader.

Each year, we seek feedback from our readers to help us improve InfoQ. Would you mind spending 2 minutes to share your feedback in our short survey? Your feedback will directly help us continually evolve how we support you.

Take the Survey

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p


Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.