Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Facebook Develops New AI Model That Can Anticipate Future Actions

Facebook Develops New AI Model That Can Anticipate Future Actions

This item in japanese

Facebook unveiled its latest machine-learning process called Anticipative Video Transformer (AVT), which is able to predict future actions by using visual interpretation. AVT works as an end-to-end attention-based model for action anticipation in videos.

The new model is based on recent breakthroughs in transformer architectures, particularly for natural language processing, and picture modeling for applications ranging from self-driving cars to augmented reality.

AVT analyzes an activity to show the potential result specially for AR and the metaverse. Facebook plans for its metaverse apps to work across other platforms and hardware, through APIs that allow programs to talk to each other.

Anticipating future activities is a difficult issue for AI since it necessitates both predicting the multimodal distribution of future activities and modelling the course of previous actions.

AVT is attention-based, so it can process a full sequence in parallel, while recurrent-neural-network-based approaches often forget the past, as they need to process sequences sequentially. AVT also features loss functions that encourage the model to capture the sequential nature of video, which would otherwise be lost by attention-based architectures such as nonlocal networks. 

AVT consists of two parts: an attention-based backbone (AVT-b) that operates on frames of video and an attention-based head architecture (AVT-h) that operates on features extracted by the backbone.

The AVT-b backbone is based on the vision transformer (VIT) architecture. It splits frames into non-overlapping patches, embeds them with a feedforward network, appends a special classification token, and applies multiple layers of multihead self-attention. The head architecture takes the per-frame features and applies another transformer architecture with causal attention. This means that it evaluates features only from the current and preceding frames. This in turn allows the model to rely solely on past features when generating a representation of any individual frame.

AVT may be used as an AR action coach or as an artificial-intelligence assistant that would warn people before they commit mistakes. In addition, AVT could be helpful for tasks beyond anticipation, such as self-supervised learning, the discovery of action schemas and boundaries, and even for general action recognition in tasks that require modeling the chronological sequence of actions.

Rate this Article