TeCH: Text-guided Reconstruction of
Lifelike Clothed Humans

1State Key Lab of CAD & CG, Zhejiang University 2Max Planck Institute for Intelligent Systems, Tübingen, Germany
3Mohamed bin Zayed University of Artificial Intelligence 4Peking University

3DV 2024

Given a single image, TeCH reconstructs a lifelike 3D clothed human. “Lifelike” refers to 1) a detailed full-body geometry, including facial features and clothing wrinkles, in both frontal and unseen regions, and 2) a high-quality texture with consistent color and intricate patterns.

Abstract

Despite recent research advancements in reconstructing clothed humans from a single image, accurately restoring the "unseen regions" with high-level details remains an unsolved challenge that lacks attention. Existing methods often generate overly smooth back-side surfaces with a blurry texture. But how to effectively capture all visual attributes of an individual from a single image, which are sufficient to reconstruct unseen areas (e.g., the back view)? Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance. To represent high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field. Guided by the descriptive prompts + personalized T2I diffusion model, the geometry and texture of the 3D humans are optimized through multi-view Score Distillation Sampling (SDS) and reconstruction losses based on the original observation. TeCH produces high-fidelity 3D clothed humans with consistent & delicate texture, and detailed full-body geometry. Quantitative and qualitative experiments demonstrate that TeCH outperforms the state-of-the-art methods in terms of reconstruction accuracy and rendering quality. The code will be publicly available for research purposes at https://github.com/huangyangyi/TeCH


Intro Video (YouTube)

Intro Video (Bilibili)

Method Overview

TeCH takes an image $\mathcal{I}$ of a human as input. Text guidance is constructed through $\textbf{(a)}$ using garment parsing model (Segformer) and VQA model (BLIP) to parse the human attributes $A$ with pre-defined problems $Q$, and $\textbf{(b)}$ embedding with subject-specific appearance into DreamBooth $\mathcal{D'}$ as unique token $[V]$. Next, TeCH represents the 3D clothed human with $\textbf{(c)}$ SMPL-X initialized hybrid DMTet, and optimize both geometry and texture using $\mathcal{L}_\text{SDS}$ guided by prompt $P=[V]+P_\text{VQA}(A)$. During the optimization, $\mathcal{L}_\text{recon}$ is introduced to ensure input view consistency, $\mathcal{L}_\text{CD}$ is to enforce the color consistency between different views, and $\mathcal{L}_\text{normal}$ serves as a surface regularizer. Finally, the extracted high-quality textured meshes $\textbf{(d)}$ are ready to be used in various downstream applications.

Qualitative Results

Comparison with SOTA single-image human reconstruction methods

We compare TeCH with baseline methods, PIFu, PaMIR and PHORHUM qualitatively on in-the-wild images from the SHHQ dataset. our training-data-free one-shot method generalizes well on real-world human images and creates rich details for the body textures, such as patterns on clothes and shoes, tattoos on the skin, and details of face and hair. While PIFu and PaMIR produce blurry results, limited by the distribution gap between training data and in-the-wild data.

Comparisons on Geometry

Comparisons on Texture

More results on in-the-wild images

Related Links

For more work on similar tasks, please check out the following papers.

Acknowledgments & Disclosure

Haven Feng contributes the core idea of "chamfer distance in RGB space". We thank Vanessa Sklyarova for proofreading, Haofan Wang, Huaxia Li, and Xu Tang for their technical support, and Weiyang Liu's and Michael J. Black's feedback. Yuliang Xiu is funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No.860768 (CLIPE). Hongwei Yi is supported by the German Federal Ministry of Education and Research (BMBF): Tubingen AI Center, FKZ: 01IS18039B. Yangyi Huang and Deng Cai are supported by the National Nature Science Foundation of China (Grant Nos: 62273302, 62036009, 61936006). Jiaxiang Tang is supported by National Natural Science Foundation of China (Grant Nos: 61632003, 61375022, 61403005).

BibTeX


@inproceedings{huang2024tech,
  title={{TeCH: Text-guided Reconstruction of Lifelike Clothed Humans}},
  author={Huang, Yangyi and Yi, Hongwei and Xiu, Yuliang and Liao, Tingting and Tang, Jiaxiang and Cai, Deng and Thies, Justus},
  booktitle={International Conference on 3D Vision (3DV)},
  year={2024}
}