Mikhail Breslav

Self-Supervised Vision Transformers

Self-Supervised Vision Transformers (DINO):

One paper I’ve been wanting to understand, at least at a high level, is DINO. DINO is another “learn a good representation of images” through self-supervision paper. In order to understand DINO I had to first take a detour into some contrastive learning literature which I wrote about here.

Let’s jump right in! Below is a figure showing the broad architecture of DINO.

DINO Architecture
Main Architecture of DINO (from the paper).

If you are familiar with BYOL you will notice that this architecture is very similar. What are the similarities?

Some areas where DINO is different than BYOL:

Related to the last bullet, an interesting aspect to the loss function used in this work is that it is a sum of cross entropy losses. That means the teacher is outputting probability distributions for all of the global views/augmentations and those are compared with probability distributions output by the student for all of the views. In general this is encouraging the student to match the teacher even if the student only sees a small part of the image. My interpretation is that the model is forced to learn the essence of the image by predicting that essence from smaller pieces. Kind of like predicting the theme of a jig-saw puzzle from smaller pieces.

Another interesting result from this paper is the visualization of attention maps from multiple heads. The way these maps are produced is by taking the encoder output corresponding to the [CLS] token and comparing it with the encoder outputs corresponding to different image patches. In the figure below we see these attention maps with different colors representing different attention heads. What’s impressive is how much the attention map looks like a segmentation map. In the first (top-left) image we see the model attends most strongly to the food on the plate and that different attention heads focus on different foods.

DINO Attention Maps
Attention Map Visualizations (from the paper).

Conclusion

DINO demonstrates that you can learn a powerful representation of images by marrying the vision transformer with a contrastive-learning-like model without the need for labels. DINO also shows us through attention maps that transformers learn representations which reflect the key/salient objects in the image.

That said, I still find it surprising that these self-supervised models rooted in contrastive learning actually work. It kind of feels like all we are telling the model is that two different but related images should be encoded in a similar way and that somehow this is enough for the model to learn. The answer may just be with enough scale simple-ish models can do amazing things.

References