Fundamental Research in Super Vision Lab

We are part of DAMO Academy, Alibaba Group, dedicated to developing next generation computer vision algorithms and technologies.

Transformers in Computer Vision

1. How to build a better transformer model

Scaled ReLU Matters for Training Vision Transformers.[AAAI’22] pdf

We verify, both theoretically and empirically, that scaled ReLU in \textit{conv-stem} not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops.

kVT: k-NN Attention for Boosting Vision Transformers. pdf

We propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations. It allows for the exploration of long range correlation and at the same time filter out irrelevant tokens by choosing the most similar tokens from the entire image. Theoretically analysis shows that k-NN attention is powerful in distilling noise from input tokens and in speeding up training.

ELSA: Enhanced Local Self-Attention for Vision Transformer. paper code

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. We comprehensively investigate local self-attention and its counterparts from two sides: channel setting and spatial processing. We propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head, to boost any transformer-based models without architecture / hyperparameter modification.

2. Applying Transformers to Vision Applications

TransReID: Transformer-based Object Re-Identification.[ICCV’21] pdf, code, presentation (in Chinese)

The first pure Transformer-based ReID method;
State-of-the-art performance on 6 ReID benchmarks, including person ReID, vehicle ReID, and occluded ReID.

CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation. pdf code

The first work to apply cross-domain Transformers for unsupervised domain adaptation;
State-of-the-art performance on several domain adaptation benchmarks.

Transformers in “2021 AI City Challenge”.

1st place in Track 2: City-Scale Multi-Camera Vehicle Re-Identification. report code
1st place in Track 3: City-Scale Multi-Camera Vehicle Tracking. report code

Effective Vehicle ReID has been achieved by ensembling multiple models of CNNs and Transformers.

back