NVIDIA Announces AI Training Dataset Generator DatasetGAN

Researchers at NVIDIA have created DatasetGAN, a system for generating synthetic images with annotations to create datasets for training AI vision models. DatasetGAN can be trained with as few as 16 human-annotated images and performs as well as fully-supervised systems requiring 100x more annotated images.

The system and experiments were described in a paper to be presented at the upcoming Conference on Computer Vision and Pattern Recognition (CVPR 2021). DatasetGAN uses NVIDIA's StyleGAN technology to generate photorealistic images. A human annotator makes detailed labels of parts of objects in the image, then an interpreter is trained on this data to generate feature labels from the StyleGAN's latent space. The result is a system that can generate an infinite number of images along with annotations, which can then be used as a training dataset for any computer vision (CV) system.

A generative adversarial network (GAN) is a system that is composed of two deep-learning models: a generator which learns to create realistic data and a discriminator which learns to distinguish between real data and the generator's output. After training, often the generator is used alone, to simply produce data. NVIDIA has used GANs for several applications, including its Maxine platform for reducing video-conference bandwidth. In 2019, NVIDIA developed a GAN called StyleGAN that can produce photorealistic images of human faces and is used in the popular website This Person Does Not Exist. Last year, NVIDIA developed a variation of StyleGAN that can take as input the desired camera, texture, background, and other data, to produce customizable renderings of an image.

Although GANs can produce an infinite number of unique high-quality images, most CV training algorithms also require that images be annotated with information about the objects in the image. ImageNet, one of the most popular CV datasets, famously employed tens of thousands of workers to label images using Amazon's Mechanical Turk. Although the workers could annotate images at the rate of almost five per minute, the images are simple pictures of a single object. More complex vision tasks, such as those required by autonomous vehicles, require images of complex scenes with semantic segmentation, where each pixel is labelled as being part of an object. According to the NVIDIA researchers, "labeling a complex scene with 50 objects can take anywhere between 30 to 90 minutes."

NVIDIA's insight with DatasetGAN is that the latent space used as input to the generator must contain semantic information about the generated image, and can therefore be used to create annotation maps for the image. The team created a training dataset for their system by first generating several images and saving the latent vectors associated with them. The generated images were annotated by human workers, and the latent vectors were paired with these annotations for training. This dataset was then used to train an ensemble of multi-layer perceptron (MLP) classifiers used as a style interpreter. The classifier input consists of the feature vectors produced by the GAN to generate each pixel, and the output is a label for each pixel; for example, when the GAN generates an image of a human face, the interpreter outputs labels indicating the part of the face, such as cheek, nose, or ear.

The researchers trained the interpreter on generated images that were labelled by an experienced human annotator. The images were of bedrooms, cars, faces, birds, and cats, with between 16 and 40 examples of each class. They then used the full DatasetGAN system to generate image datasets, which were then used to train standard CV models. The team used several common CV benchmarks, such as Celeb-A and Stanford Cars, to compare the performance of their models trained on the generated datasets to baseline models trained using current state-of-the-art transfer-learning and semi-supervised techniques. The NVIDIA models "significantly" outperformed the baseline on all benchmarks, given the same amount of annotated images.

The use of synthetic data for training AI is an active research topic, since it reduces the cost and labor associated with dataset creation. One common technique for mobile robot and autonomous vehicle training is to use virtual environments or even video games as a source of data. In 2015, researchers at the University of Massachusetts Lowell used crowdsourced CAD models to train image classifiers. In 2017, Apple developed a system to use a GAN for improving the quality of synthetic images for CV training, but this technique did not produce pixel-level semantic labels.

Although NVIDIA has open-sourced StyleGAN, the code for DatasetGAN has not been released. In a Twitter discussion about the work, co-author Huan Ling noted that the team is working on a release and hope to meet the deadline for this year's NeurIPS conference.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter