3D Point Cloud Object from Text Prompts Using Diffusion Models

OpenAI recently released an alternative method called Point-E for 3D object generation from text prompts that takes less than two minutes on a single GPU, versus the other methods that could take a few GPU hours. This new model is based on diffusion models, which are generative models like GLIDE and StableDiffusion.

The model pipeline starts with generating a synthetic view conditioned on a text prompt. Next, it conditions a 3D point cloud (1024 points) on the synthetic view. Finally, it produces a fine 3D point cloud (4096 points) conditioned on the low-resolution point cloud and the synthetic view (check the figure below).

Source: Point·E: A System for Generating 3D Point Clouds from Complex Prompts

First, a diffusion model neural-network called GLIDE generates images from text prompts. Blender, open-source 3D CG technology, then uses a dataset trained by rendering 20 camera images of an object to generate depth images of that object. In order to match the 3D point cloud and images, each 3D point cloud is associated with each pixel in a depth image. Finally, some point-cloud processing is applied to the data for better perfomance.

The next step in the pipeline is relating the point-cloud with the text-prompt model mentioned earlier. The deep-learning model used is a transformer that generates 3D point-cloud with colors in a probabilistic fashion method (the figure below shows the full model pipeline).

Source: Point·E: A System for Generating 3D Point Clouds from Complex Prompts

For the point-cloud upsampling, a transformer is used to increase the resolution of the final 3D point-cloud using as input the lower-resolution one.

After having a better-resolution 3D point cloud, the authors convert it into texture meshes and render these meshes on Blender. The process uses a regression model to predict the signed distance field (SDF) of an object given its point cloud, and then applying marching cubes to the resulting SDF to extract a mesh. The color assignment uses the “nearest neighbor” method to match each vertex to the nearest point from the original point cloud.

Source: Point·E: A System for Generating 3D Point Clouds from Complex Prompts

Earlier this year, Google released DreamFusion, an expanded version of Dream Fields, a generative 3D system that the company unveiled back in 2021. Comparing DreamFusion with Point-E based on a semantic metric called R-Precision, we can tell from the above table that the first one has better performance in that regard, i.e., understands better the text prompts and the point cloud generated has better resolution. However, we can tell that Point-E is much faster at outputting a 3D point-cloud object.

The limitations of Point-E are the low texture and resolution of its 3D point-cloud objects. It requires synthetic renderings, which could be replaced by conditioning on real-world images. The semantic understanding from text prompts is not as good as other state-of-the-art 3D generation models.

OpenAI released an open-source implementation of Point-E In PyTorch. For instance, if one wants to generate from text prompt a 3D object using Point-E , the following script can can be helpful:

import torch
from tqdm.auto import tqdm

from point_e.diffusion.configs import DIFFUSION_CONFIGS, diffusion_from_config
from point_e.diffusion.sampler import PointCloudSampler
from point_e.models.download import load_checkpoint
from point_e.models.configs import MODEL_CONFIGS, model_from_config
from point_e.util.plotting import plot_point_cloud

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print('creating base model...')
base_name = 'base40M-textvec'
base_model = model_from_config(MODEL_CONFIGS[base_name], device)
base_model.eval()
base_diffusion = diffusion_from_config(DIFFUSION_CONFIGS[base_name])

print('creating upsample model...')
upsampler_model = model_from_config(MODEL_CONFIGS['upsample'], device)
upsampler_model.eval()
upsampler_diffusion = diffusion_from_config(DIFFUSION_CONFIGS['upsample'])

print('downloading base checkpoint...')
base_model.load_state_dict(load_checkpoint(base_name, device))

print('downloading upsampler checkpoint...')
upsampler_model.load_state_dict(load_checkpoint('upsample', device))

sampler = PointCloudSampler(
    device=device,
    models=[base_model, upsampler_model],
    diffusions=[base_diffusion, upsampler_diffusion],
    num_points=[1024, 4096 - 1024],
    aux_channels=['R', 'G', 'B'],
    guidance_scale=[3.0, 0.0],
    model_kwargs_key_filter=('texts', ''), # Do not condition the upsampler at all
)

# Set a prompt to condition on.
prompt = 'a red motorcycle'

# Produce a sample from the model.
samples = None
for x in tqdm(sampler.sample_batch_progressive(batch_size=1, model_kwargs=dict(texts=[prompt]))):
    samples = x

pc = sampler.output_to_point_clouds(samples)[0]
fig = plot_point_cloud(pc, grid_size=3, fixed_bounds=((-0.75, -0.75, -0.75),(0.75, 0.75, 0.75)))

Source: OpenAI

This release has created a buzz on Twitter. It seems people are enthusiastic about the model speed:

On Reddit, people seem quite enthusiastic as well about the fast 3D point-cloud generation from text-prompts:

If you want to try a demo, go to HuggingFace workspace and try it out.

About the Author

Bruno Santos

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Bruno Santos

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter