We present 3DiM, a diffusion model for 3D novel view synthesis, which is able to translate a single input view into consistent and sharp completions across many views. The core component of 3DiM is a pose-conditional image-to-image diffusion model, which is trained to take a source view and its pose as inputs, and generates a novel view for a target pose as output. 3DiM can then generate multiple views that are approximately 3D consistent using a novel technique called stochastic conditioning. At inference time, the output views are generated autoregressively. When generating each novel view, one selects a random conditioning view from the set of previously generated views at each denoising step. We demonstrate that stochastic conditioning significantly improves 3D consistency compared to a naive sampler for an image-to-image diffusion model, which involves conditioning on a single fixed view. We compare 3DiM to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM's generated completions from a single view achieve much higher fidelity, while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to quantify the 3D consistency of a generated object by training a neural field on the model's output views. 3DiM is geometry free, does not rely on hyper-networks or test-time optimization for novel view synthesis, and allows a single model to easily scale to a large number of scenes.
Anonymous Authors
We include samples from a single 3DiM trained on all of ShapeNet. Unlike the models studied in the paper, this 3DiM only takes the relative pose as input, rather than source and target absolute poses. This allows us to test the model on arbitrary images, as we only ask for relative changes in pose.
To increase robustness, the training view pairs are rendered with objects at a random orientation and varying scales. We rendered 128 pairs (256 views) for each training object using kubric. We also add random hue augmentation during training. The model was trained for 420K training steps at batch size 256, and has 471M parameters. We use 256 denoising steps at generation time.
We include some results on images we found on the internet that have white backgrounds and correspond to some of the ShapeNet classes. We also include failure modes for images that are too out-of-distribution for the model to handle well.
We show samples from a single 3DiM trained on all of ShapeNet. We rendered 250 views for each asset and trained a 471M parameter 3DiM. Videos are sampled from a single input image, with 256 denoising steps.
We compare against prior state-of-the-art methods on novel view synthesis from few images on the SRN ShapeNet benchmark. The methods whose outputs we could acquire all guarantee 3D consistency, due to the use of volume rendering (unlike 3DiM). We render the same trajectories given the same conditioning image.
Prior methods directly regress outputs, often leading to severe bluriness. We show that 3DiM overcomes this problem: it is a generative model by design, and diffusion models have a natural inductive bias towards generating much sharper samples. Below we show more samples from the 3DiMs we trained for prior work comparisons; a 471M parameter 3DiM for cars, and a 1.3B parameter 3DiM for chairs.
Generation with 3DiM -- We propose stochastic conditioning, a new sampling strategy where we generate views autoregressively with an image-to-image diffusion model. At each denoising step, we condition on a random previous view, so the denoising process is guided to be 3D consistent to all previous frames with enough denoising steps.
X-UNet -- Our proposed changes to the image-to-image UNet, which we show are critical to achieve high-quality results.