Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Ryan Burgert
Kanchana Ranasinghe
Xiang Li
Michael Ryoo


Peekaboo performs zero-shot segmentation, which takes in an image and a caption, and finds the corresponding image region. To our knowledge, this is the first open-vocabulary, zero-shot segmentor that can comprehend detailed descriptions. We show that off-the-shelf text-to-image diffusion models can perform zero-shot segmentation without any further training. Our implementation is publicly available.

Abstract

Recently, text-to-image diffusion models have shown remarkable capabilities in creating realistic images from natural language prompts. However, few works have explored using these models for semantic localization or grounding. In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases without segmentation-specific re-training. We introduce an inference time optimization process capable of generating segmentation masks conditioned on natural language prompts. Our proposal, Peekaboo, is a first-of-its-kind zero-shot, open-vocabulary, unsupervised semantic grounding technique leveraging diffusion models without any training. We evaluate Peekaboo on the Pascal VOC dataset for unsupervised semantic segmentation and the RefCOCO dataset for referring segmentation, show- ing results competitive with promising results. We also demonstrate how Peekaboo can be used to generate images with transparency, even though the underlying diffusion model was only trained on RGB images - which to our knowledge we are the first to attempt.



Methodology




Overview of Peekaboo Architecture: The image to be segmented is subject to alpha compositing with a learnable mask represented as an implicit neural image. The composite image and text prompt relating to the image region to be segmented are fed to our proposed dream loss, which is optimized iteratively. At the end of optimization, the implicit neural image converges to the optimal segmentation mask. We highlight that our dream loss is used only for learning a mask and not for any re-training of the diffusion model.


CVPR 2023 Poster


Citation

[Bibtex]