DUNIT: Detection-based Unsupervised Image-to-Image Translation

CVPR 2020


Deblina Bhattacharjee, Seungryong Kim, Guillaume Vizier, Mathieu Salzmann,

Paper Code Poster

Image-to-image translation has made great strides in recent years, with current techniques being able to handle unpaired training images and to account for the multimodality of the translation problem. Despite this, most methods treat the image as a whole, which makes the results they produce for content-rich scenes less realistic. In this paper, we introduce a Detection-based Unsupervised Image-to-image Translation (DUNIT) approach that explicitly accounts for the object instances in the translation process. To this end, we extract separate representations for the global image and for the instances, which we then fuse into a common representation from which we generate the translated image. This allows us to preserve the detailed content of object instances, while still modeling the fact that we aim to produce an image of a single consistent scene. We introduce an instance consistency loss to maintain the coherence between the detections. Furthermore, by incorporating a detector into our architecture, we can still exploit object instances at test time. As evidenced by our experiments, this allows us to outperform the state-of-the-art unsupervised image-to-image translation methods. Furthermore, our approach can also be used as an unsupervised domain adaptation strategy for object detection, and it also achieves state-of-the-art performance on this task.

Pipeline



Overall DUNIT architecture. The instance aware I2I translation block on the right is the exact replica of the operations taking place between the night image in domain X and the corresponding translated day image. Similarly, the global I2I translation block mirrors the operations between the day image in domain Y and its translated night image.The blue background separates our contribution from the DRIT backbone on which our work is built. The pink lines correspond to domain X and the black lines to domain Y . The global-level residual blocks have different features in domain X and domain Y and hence are color coded differently. The global features in domain X are shown in dark blue, those in domain Y in dark grey, the losses are in green, the global operations are in light orange, the instance features in domain X are in yellow, the detection subnetwork in light blue and the merged features in dark orange.

Results



Qualitative comparisons conditioned on one image style for Sunny to Cloudy (first row) and Sunny to Rainy (second row). We show, from left to right, the input image in the source domain, the style image for translation, followed by outputs from MUNIT, DRIT and DUNIT (ours), respectively.



Qualitative comparisons conditioned on different image styles on the DCM benchmark. We show, from left to right, the input image in the source domain, the style image for translation, using DUNIT (ours), respectively.

Unsupervised Domain Adaptive Detection



Qualitative domain adaptation results. We translate Pascal VOC images to the DCM comics domain using DUNIT and apply a detector trained on the original DCM data to the translated images. (Left) Input image, (remaining columns) translated image and detections.

Bibtex


@InProceedings{Bhattacharjee_2020_CVPR, author = {Bhattacharjee, Deblina and Kim, Seungryong and Vizier, Guillaume and Salzmann, Mathieu}, title = {DUNIT: Detection-Based Unsupervised Image-to-Image Translation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2020}, }