Google AI Research published a paper describing their work on depth perception from two-dimensional images. Using a training dataset created from YouTube videos of the Mannequin Challenge, researchers trained a neural network that can reconstruct depth information from videos of moving people, taken by moving cameras.
A common problem in computer vision is to reconstructing three-dimensional information from a two-dimensional image. The output of this process is a "depth map," where the original 2D image RGB pixel values are overlaid with an array of values representing the distance from the camera to the spot where the light representing the pixel originated. This has many real-world applications, including augmented reality (AR) or robot navigation.
A class of sensors called RGB-D sensors, such as the Kinect, can output depth data directly along with a 2D RGB image. Constructing the depth map from just RGB image data alone is often done by triangulation, either with multiple cameras (similar to natural vision systems that are based on multiple eyes) or a single moving camera. The single moving camera approach works by using the parallax between successive frames, but this does not work as well when objects in the scene are also moving. Accurate depth reconstruction from a single camera is necessary for many of the applications, especially AR on mobile phones. In particular, Google's researchers were interested in depth reconstruction from images with many moving objects, including people. These scenes are even more challenging, as human bodies not only move: the various parts of their bodies move in relation to each other, effectively changing the shape of the person in the camera images as well as the relative depth of each body part. To tackle this problem with machine learning, researchers need a large dataset of videos containing people, filmed with a moving camera. A team from University of Washington used datasets created by video games to convert 2D videos of soccer games into 3D, but this limits their system to only work on soccer games.
Enter the Mannequin Challenge (MC), an internet meme where groups of people impersonate mannequins by assuming a fixed pose while a videographer moves around the scene taking a video. Because the camera is moving and the rest of the scene is static, parallax methods can easily reconstruct accurate depth maps of human figures in a variety of poses. Researchers processed around 2,000 YouTube MC videos to produce a dataset of "4,690 sequences with a total of more than 170K valid image-depth pairs."
Given this dataset, the team further processed it to create input to a deep neural-network (DNN). For a given frame, a preceding frame was compared for parallax, to get an initial depth map. The input frame was also segmented using a vision system that detects humans. This created a human mask which is used to clear out the initial depth map in the areas where humans are found. The target of the learning system is the known depth map for the input image, computed from the MC videos. The DNN learned to take the input image, initial depth map, and human mask, and output a "refined" depth map where the depth values of humans were filled in.
Google suggested this technology could have several applications, including "3D-aware video effects [such as] synthetic defocus". A commenter on Reddit suggested a mobile phone application that translates depth into sound "to help blind people navigate."
On Twitter, AR researcher Ross Brown said,
"What is also interesting is that I use ZedCams for depth map generation in project Proteus, now I only need [a digital SLR]. This really opens it up. Time to read up on TensorFlow..."
Interestingly, the DNN code is based on PyTorch, and not Google's TensorFlow framework. The inference code and pretrained models are available on GitHub. The project's page claims the dataset is "coming soon."