Vision Transformer Adapters for Generalizable Multitask Learning

We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added. We introduce a task-adapted attention mechanism within our adapter framework that combines gradient-based task similarities with attention-based ones. The learned task affinities generalize to the following settings: zero-shot task transfer, unsupervised domain adaptation, and generalization without fine-tuning to novel domains. We demonstrate that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones.

Method Architecture

Detailed overview of our architecture. The frozen transformer encoder module (in orange) extracts a shared representation of the input image, which is then utilized to learn the task affinities in our novel vision transformer adapters (in purple). Each adapter layer uses gradient task similarity (TROA) (in yellow) and Task-Adapted Attention (TAA) to learn the task affinities, which are communicated with skip connections (in blue) between consecutive adapter layers. The task embeddings are then decoded by the fully-supervised transformer decoders (in green) for the respective tasks. Note that the transformer decoders are shared but have different task heads (in grey). For clarity, only three tasks are depicted here and TAA is explained in a separate figure below.

Vision Transformer Adapter Module

Overview of our vision transformer adapter module. Our vision adapters learn transferable and generalizable task affinities in a parameter-efficient way. We show two blocks to depict the skip connectivity between them. The main modules (TROA) and (TAA) of our vision transformer adapters are depicted below.

Task Representation Optimization Algorithm (TROA)

We show the task affinities from TROA when four tasks comprising semantic segmentation (SemSeg), depth, surface normal, and edges are jointly learned. We show that TROA learns a strong task affinity between the same task gradients, for example, segmentation with segmentation. This is a self-explanatory observation. Consequently, TROA also learns task affinities between proximate tasks such as segmentation and depth, and task affinities between non-proximate tasks such as segmentation and normal. Note that task dependence is asymmetric, i.e. segmentation does not affect normal as normal effects segmentation. These task affinities are used by our novel task-adapted attention module as described in what follows.

Matching the Feature Dimensions using FiLM

Detailed overview of Feature Wise Linear Modulation (FiLM)} which linearly shifts and scales tasks representations to match dimensions of the feature maps. The orange rectangular area is FiLM.

Task-Adapted Attention

Overview of our Task-Adapted Attention (TAA) mechanism that combines task affinities with image attention. Note that the process, in the foreground, is for a single attention head which is repeated for 'M' heads to give us the task-adapted multi-head attention.

Multitasking Results

Multitask Learning comparison on the NYUDv2 benchmark in the'S-D-N-E' setting. Our model outperforms all the multitask baselines, i.e. ST-MTL, InvPT, Taskprompter, and MulT, respectively. For instance, our model correctly segments and predicts the surface normal of the elements within the yellow-circled region, unlike the baseline. All the methods are based on the same Swin-B V2 backbone. Best seen on screen and zoomed in. For more details and quantitative results, please refer to our paper.

Multitask Learning comparison on the Taskonomy benchmark in the'S-D-N-E' setting. Our model outperforms all the multitask baselines, respectively. For instance, our model correctly segments and predicts the surface normal of the elements within the yellow-circled region, unlike the baseline. All the methods are based on the same Swin-B V2 backbone. Best seen on screen and zoomed in. For more details and quantitative results, please refer to our paper.

Unsupervised Domain Adaptation (UDA)

Unsupervised Domain Adaptation (UDA) results on Synthia->Cityscapes. Our model outperforms the CNN-based baseline (XTAM-UDA) and the Swin-B V2-based baselines (1-task Swin-UDA, MulT-UDA), respectively. For instance, our method can predict the depth of the car tail light, unlike the baselines. Best seen on screen and zoomed within the yellow circled region.

Bibtex

@misc{bhattacharjee2023vision, title={Vision Transformer Adapters for Generalizable Multitask Learning}, author={Deblina Bhattacharjee and Sabine Süsstrunk and Mathieu Salzmann}, year={2023}, eprint={2308.12372}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Acknowledgement

This work was supported in part by the Swiss National Science Foundation via the Sinergia grant CRSII5$-$180359.