Unleashing Text-to-Image Diffusion Models for Visual Perception


Wenliang Zhao*   Yongming Rao*   Zuyan Liu*   Benlin Liu   Jie Zhou   Jiwen Lu

 Tsinghua University

[Paper (arXiv)]      [Code (GitHub)]


VPD (Visual Perception with Pre-trained Diffusion Models) is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCOval referring image segmentation, establishing new records on these two benchmarks.

Figure 1: The main idea of the proposed VPD framework. Motivated by the compelling generative semantic of a text-to-image diffusion model, we proposed a new framework named VPD to exploit the pre-trained knowledge in the denoising UNet to provide semantic guidance for downstream visual perception tasks.


Abstract

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCOval referring image segmentation, establishing new records on these two benchmarks.


Approach

Figure 2: The overall framework of VPD. To better exploit the semantic knowledge learned from text-to-image generation pre-training, we prompt the denoising UNet with properly designed text prompts and employ the cross-attention maps to provide both implicit and explicit guidance to downstream visual perception tasks. Our framework can fully leverage both the low-level and high-level pre-trained knowledge and can be applied in a variety of visual perception tasks.


Results

Table 1: Semantic segmentation with different methods. We compare our VPD with previous methods. Our VPD can achieve good results with a more lightweight Semantic FPN with smaller crop size and fewer training iterations.

Table 2: Semantic segmentation with fewer training iterations. We compare the performance of our VPD with previous models with different architectures and different pre-training methods. The performance is measured by the mIoU at 4K/8K iterations.

Table 3: Referring image segmentation on RefCOCO. We compare our VPD with previous methods with both the default training schedule and fast schedule (1 epoch) on three benchmark datasets of RefCOCO (RefCOCO, RefCOCO+, and G-Ref). The performance is measured by the overall IoU on the validation set. We show our VPD achieves better performance consistently on all three benchmarks.

Table 4: Depth estimation on NYUv2 . We report the commonly used metrics for depth estimation including RMSE, δn, REL and log10 (see Section 4.1 for details). We show that VPD outperforms previous state-of-the-art methods consistently in all the metrics. We also demonstrate our model converges faster than SwinV2 pre-trained with masked image modeling in the fast training schedule.


BibTeX

@article{zhao2023unleashing,

  title={Unleashing Text-to-Image Diffusion Models for Visual Perception},

  author={Zhao, Wenliang and Rao, Yongming and Liu, Zuyan and Liu, Benlin and Zhou, Jie and Lu, Jiwen},

  journal={arXiv preprint arXiv:2303.02153},

  year={2023}

}