Google Releases Objectron Dataset for 3D Object Recognition AI

Google Research announced the release of Objectron Dataset, a machine-learning dataset for 3D object recognition. The dataset contains 15k video segments and 4M images with ground-truth annotations, along with tools for using the data to train AI models.

Software engineers Adel Ahmadyan and Liangkai Zhang gave a high-level description of the dataset in a blog post. The dataset consists of 15,000 short video clips from a moving camera focused on a common household object. Each clip is annotated with 3D bounding boxes for the object, along with augmented reality (AR) metadata, including the pose of the camera and information about planar surfaces in the video. The dataset also contains 4M annotated single-frame images. Along with the dataset, Google has also released a new MediaPipe object-detection solution based on a subset of the data. According to Ahmadyan and Zhang,

By releasing this Objectron dataset, we hope to enable the research community to push the limits of 3D object geometry understanding. We also hope to foster new research and applications, such as view synthesis, improved 3D representation, and unsupervised learning.

The success of deep-learning models for 2D object recognition was driven in part by the availability of large-scale high-quality datasets such as ImageNet and COCO. However, these datasets often require expensive and painstaking manual effort to produce ground-truth annotations for supervised learning. 2D annotations typically include a class label (what the object is) as well as a bounding box (where it is). Annotating objects in three dimensions or in video streams is even more difficult and expensive, but the resulting models can provide more information, including an object's orientation (or pose) and motion, making them useful in applications such as robotics and AR. While there are some publicly-available datasets for training these models, many of them are intended for autonomous-vehicle applications, and the annotated objects are restricted to classes of interest to driving, such as pedestrian, cyclist, and vehicle.

Earlier this year, Google released MediaPipe Objectron, a 3D object-detection solution for MediaPipe, Google's open-source framework for ML applications using streaming media. The Objectron solution was based on the MobilePose deep-learning model for detecting object pose using a mobile phone camera. The model is small and fast enough to run in real-time on a resource-limited device, but can only recognize two classes of object: shoes and chairs. The new solution uses an updated model architecture and can recognize four object classes: shoes, chairs, mugs, and cameras.

The full Objectron dataset has annotations for nine classes of object: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes. The total dataset size is around 4.4TB, and can be accessed using the tf.record format for use in TensorFlow and PyTorch training. To label the data, including object position and pose, Google built a smartphone-based tool which allows users to quickly draw bounding boxes for objects. Google also created synthetic data by using AR techniques to render virtual objects into a real image. This synthetic data increased the Objectron model accuracy "by about 10%." To help other researchers develop their own models based on the dataset, Google has also released source code for their accuracy metric algorithm.

The Objectron dataset is available to download from Google Cloud storage, while supporting scripts and tutorials are available on GitHub. Instructions for using the MediaPipe solution are available on the MediaPipe website.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter