Generative AI for 3D content in XR and beyond

Raju Kandaswamy and

Kuldeep Singh

Published: August 08, 2023

Conversations around generative AI are dominated by large language models (LLMs), which aim to revolutionize text-based communication. However, advancements in the text-to-image space, powered by latent diffusion models (LDMs) such as DALL-E from OpenAI and Stable Diffusion from Stability, open up great possibilities in hitherto untapped areas.

In simple terms, LDMs can convert natural language inputs into photorealistic images to ultra-realistic art. An LDM's building blocks are training (or learning) and inference.

In the learning phase, a neural network is trained to add noise to images generated with a mathematical model such as Gaussian. Then the noise is blended with an image with latent text description. This is known as a forward pass. In this process, the neural networks model what a latent space description may look like in the form of a very noisy image representation.

generative-ai-for-3d-content-in-xr-and-beyond

Figure 1: Forward pass → Photo blur to noise

In the inference phase, the learning is applied by a reverse pass. In the reverse pass, a random RGB noise is generated using sampling techniques such as Euler or LPMS. Then the generated noise is denoised step by step. In each step, the AI will try to bring the latent space description provided in the prompt into the image by denoising. Typically, within 10-15 steps the image will have most of the features described in the prompt. Every additional step will bring more clarity and detail into the image.

Figure 2: Reverse pass → Noise to photo for the prompt “a red rose”

Knowing that, let’s see how we can generate 3D models with text inputs.

Creating 3D content with generative AI

Figure 3: Workflow for creating 3D content with LDMs

Creating 3D models using LDMs involves four fundamental steps. Here, we discuss how you can optimize each step to generate the output you need.

#1 Prompt Engineering for Photography

Creating anything with generative AI begins with the text input, called a prompt. Writing clear and specific prompts will generate better outputs. To do that in LDMs, it’s important to bring best practices from photography to prompt engineering nomenclature.

In LDM models like stable diffusion, latent space description is formed by tokens. A token can be a simple English word, a name or any number of technical parameters. Here are some tokens you can use to produce good-looking images.

Photography tokens

Here are an indicative list of tokens with examples

<subject/object description with pose> - An “Indian girl jumping”

<subject alignment> - An Indian girl jumping, “centered”

<detailing> - An Indian girl jumping in the middle “of a flower garden”

<lighting> - An Indian girl jumping in the middle of a flower garden “daylight, sun rays”

<resolution> - An Indian …, “Ultra high definition”

<camera angle> - An Indian …, “Aerial Shot”

<camera type> - An Indian …, “DSLR photo”

<lens parameters>, - An Indian …, “f/1.4”

<composition> - An Indian … “at the edge of a lake”

<fashion> - An Indian … “south indian clothing”

<subject size> - An Indian … “close-up shot”

<studio setting> - An Indian … “Studio lighting”

<background> - An Indian … “, background rocky mountain range”

Art tokens

When used with the photography tokens, the following tokens can shape the artistic direction of the output:

#2 Photo generation

In addition to tokens and nomenclature, the generative AI model’s inference (of input) is influenced by the following four parameters:

The text prompt
The sampling method used for noise generation (Euler, LPMS or DDIM)
The number of steps (ranges from 20 to 200)
The CFG (classifier free guidance) scale. This is a scale that sets the extent to which the AI will adhere to a given text prompt. The lower the value, the more “freedom” the AI has to bring in elements further from the prompt

Here’s what AI produced for the following prompts

Prompt: “3d tiny isometric of a modern western living room in a (((cutaway box))), minimalist style, centered, warm colors, concept art, black background, 3d rendering, high resolution, sunlight, contrast, cinematic 8k, architectural rendering, trending on ArtStation, trending on CGSociety”

Photo generated by DreamShaper (Stable diffusion 1.5)

#3 3D synthesis

Once we have the photograph, the next step is to create continuous volumetric scenes from sparse sets of photographs/views. We used Google’s Dream Fusion to initiate a Neural Radiance Fields (NeRF) model with a single photograph.

Dream Fusion takes NeRF as a building block and uses a mathematical process called probability density distillation to perfect the initial 3D model formed by a single photograph. It then uses gradient descent to adjust the 3D model until it fits the 2D image as closely as possible when viewed from random angles. Probability density distillation leverages the knowledge learned from the 2D image model to improve the creation of the 3D model. The output from this step is a point cloud.

Another approach is constructing 3D models using monocular depth sensing models. The first step in this approach is depth estimation using the monocular photo. The next step is to use a 3D point cloud encoder to predict and correct the depth-shift to construct a realistic 3D scene shape. This approach is faster and can be executed even in low-end hardware, but the constructed 3D model may be incomplete.

#4 3D mesh model

The NeRF model constructed in the previous step has a point cloud model of the scene. This point cloud is meshed using Marching Cubes or Poisson mesh technique to produce the mesh model. The texture for the mesh model is generated from the RGB color values of the point cloud.

3D mesh model based on the photograph generated above

In a noise-to-photo generation model like stable diffusion, users have little control over the generation process. Despite keeping parameters identical, any given prompt can generate a new variant of the image each time you try. To enhance an existing photograph without losing its shape and geometry details, ControlNet models are helpful.

ControlNet uses typical image processing outcomes such as Canny, Hough, HED and depth to preserve the shape and geometry information during the generation process. It can boost the productivity of content designers in iterating combinations of styles, materials, lighting, etc.

Variants produced using a ControlNet model preserving the shape, geometry and pose

Latent diffusion models and generative AI can transform not only XR applications but also prove useful in segments like media/entertainment, building simulation environments for autonomous vehicles and so on.

In conclusion, the realm of 3D creation is undergoing a remarkable transformation. The paradigm shift is granting unrestricted opportunities for individualism while simultaneously streamlining the creative process, eliminating challenges centered on time, budgets and laborious labor.

We expect generative AI-led 3D modeling to forge ahead and explore innovative functionalities. 3D authoring could be an effortless experience where people, irrespective of budgets and industry could unleash their creative potential.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Industries

Publications and Tools

All Insights