TV-3DG: Mastering Text-to-3D Customized Generation with Visual Prompt

1Harbin Institute of Technology. 2Space AI, Li Auto Inc. 3School of Software, Tsinghua University. 4University of Science and Technology of China. 5Hefei University of Technology, China. 6Department of Automation, Tsinghua University.

TV-3DG is a novel customized generation framework that leverages text descriptions and single image guidance to produce high-quality and intricately stylized 3D assets.

Abstract

In recent years, advancements in generative models have significantly expanded the capabilities of text-to-3D generation. Many approaches rely on Score Distillation Sampling (SDS) technology. However, SDS struggles to accommodate multi-condition inputs, such as text and visual prompts, in customized generation tasks. To explore the core reasons, we decompose SDS into a difference term and a classifier-free guidance term. Our analysis identifies the core issue as arising from the difference term and the random noise addition during the optimization process, both contributing to deviations from the target mode during distillation. To address this, we propose a novel algorithm, Classifier Score Matching (CSM), which removes the difference term in SDS and uses a deterministic noise addition process to reduce noise during optimization, effectively overcoming the low-quality limitations of SDS in our customized generation framework. Based on CSM, we integrate visual prompt information with an attention fusion mechanism and sampling guidance techniques, forming the Visual Prompt CSM (VPCSM) algorithm. Furthermore, we introduce a Semantic-Geometry Calibration (SGC) module to enhance quality through improved textual information integration. We present our approach as TV-3DG, with extensive experiments demonstrating its capability to achieve stable, high-quality, customized 3D generation.

Figure 1. TV-3DG can achieve customized 3D contents from text prompt and reference image, specifically on text-to-3D task and 3D stylized task.


Customized Results of TV-3DG

Text Prompts: A DSLR photo of a standing golden retriever(, wearing red hat). ; A 3D model of an adorable cottage with a thatched roof. ; A chef is making pizza dough in the kitchen. ; A woman is doing squats with a kettlebell in a fitness studio, HD, 4K. ; A rabbit, high detail 3d model. ; A sleek, silver SUV. ; The Black Wukong as visual prompt under different text prompts. ; A DSLR photo of a classic Packard car.


Visualization

Visual results of TV-3DG with various customized text and reference visual prompts. We extend our gratitude to the Civitai community for providing some of the intricate reference images.


Qualitative Comparisons of 3D Stylized Generation Task

Comparative analysis of 3D stylized generation task between our method and established baselines. Experimental outcomes indicate that our approach proficiently produces stylized 3D assets. For the VP3D baseline, since it is not open-sourced, we compare our results based on their official demo. This corresponds to the example in the top-left corner: "A rabbit, high detailed 3D model". Please zoom in to view details.


Qualitative Comparisons of Text-to-3D Generation Task

Comparison of text-to-3D task in our method against existing text-to-3D baselines. Experimental results demonstrate that our method effectively generates complex 3D content closely aligned with the provided text prompts, characterized by high fidelity and detailed intricacy. Please zoom in to view details.