Meta AI Research open-sourced DINOv2, a foundation model for computer vision (CV) tasks. DINOv2 is pretrained on a curated dataset of 142M images and can be used as a backbone for several tasks, including image classification, video action recognition, semantic segmentation, and depth estimation.
Meta based the model on the Vision Transformer (ViT) architecture, with modifications for self-supervised learning objectives. To train the model, the team built an automated pipeline to build a curated dataset of images scraped from the web. A major contribution of the work was improving in the training process, which is twice as fast and uses one-third the memory of previous approaches. When evaluated on CV benchmarks, DINOv2 outperformed other self-supervised learning (SSL) models and showed performance comparable to or better than that of weakly-supervised models. According to Meta,
Going forward, the team plans to integrate this model, which can function as a building block, in a larger, more complex AI system that could interact with large language models. A visual backbone providing rich information on images will allow complex AI systems to reason on images in a deeper way than describing them with a single text sentence. Models trained with text supervision are ultimately limited by the image captions. With DINOv2, there is no such built-in limitation.
Deep learning models for CV tasks have typically relied on large datasets of images with human annotations; for example, ImageNet. In 2021, OpenAI released CLIP, a foundation model for CV that was trained using a form of weak supervision, where the annotations were automatically derived by scraping html tags and other web-based metadata associated with source images. That same year, Google published the ViT model, which uses SSL for training, and Meta published their work on the original version of DINO, which combines a ViT model with knowledge distillation, which resulted in smaller models with comparable performance.
For DINOv2, Meta focused on gathering more training data and scaling up the training process. For the training data, Meta collected 1.2B unique images from the internet, then clustered them according to their similarity to the images in the ImageNet dataset for a final set of 142M images. To scale up training, Meta implemented a custom version of FlashAttention and used Fully-Sharded Data Parallel (FSDP) training with PyTorch. Overall, the project consumed about 200k GPU-days of compute.
To evaluate DINOv2's performance as a foundation model, the team tested it on a variety of CV tasks and compared it to several baseline SSL models as well as weakly-supervised models such as CLIP. On the ImageNet-1k classification task, DINOv2 showed a "very significant improvement" compared to other SSL models and also outperformed the weakly-supervised ones. It also set a new SSL state-of-the-art record on three video action recognition benchmarks and outperformed baselines on instance-level recognition benchmarks and on three monocular depth estimation benchmarks.
In a Hacker News discussion about the work, several users praised Meta's recent work in computer vision as well as past contributions such as PyTorch. One did note a shift in Meta's communications around their work:
As a grad student in this field, Meta has always had great contributions to the open source machine learning effort, through no small effort of Yann LeCun's internal advocacy. What has changed recently is their PR strategy: [OpenAI] has basically shown everybody that it doesn't matter if you have the best models if your publicity sucks.
The DINOv2 code and models are available on GitHub. The project site hosts an interactive demo of several computer vision tasks using DINOv2.