Frustratingly Easy Test-Time Adaptation of Vision-Language Models

unitn 1
ensta 2
fbk 3

Abstract

Vision-Language Models seamlessly discriminate among arbitrary semantic categories, yet they still suffer from poor generalization when presented with challenging examples. For this reason, Episodic Test-Time Adaptation (TTA) strategies have recently emerged as powerful techniques to adapt VLMs in the presence of a single unlabeled image. The recent literature on TTA is dominated by the paradigm of prompt tuning by Marginal Entropy Minimization, which, relying on online backpropagation, inevitably slows down inference while increasing memory. In this work, we theoretically investigate the properties of this approach and unveil that a surprisingly strong TTA method lies dormant and hidden within it. We term this approach ZERO (TTA with “zero” temperature), whose design is both incredibly effective and frustratingly simple: augment N times, predict, retain the most confident predictions, and marginalize after setting the Softmax temperature to zero. Remarkably, ZERO requires a single batched forward pass through the vision encoder only and no backward passes. We thoroughly evaluate our approach following the experimental protocol established in the literature and show that ZERO largely surpasses or compares favorably w.r.t. the state-of-the-art while being almost 10× faster and 13× more memory friendly than standard Test-Time Prompt Tuning. Thanks to its simplicity and comparatively negligible computation, ZERO can serve as a strong baseline for future work in this field.

TLDR; Don't forget about majority voting when you evaluate your TTA method :)

Takeaways


Background on Marginal Entropy Minimization. Test-Time Adaptation aims at adapting a model to a single image at inference time. Ideally, no form of prior or external knowledge should be employed in doing so. An established paradigm for TTA is Marginal Entropy Minimization, which works by augmenting the image N times, computing the so-called "marginal probability distribution" (i.e., the average probability distribution over the views), and minimizing the entropy of this distribution.

Findings. We find that the argmax of the marginal distribution is invariant to MEM most of the time (and can be guaranteed to be so under certain conditions), and that this marginal distribution itself is reasonably better than standard inference, under the assumption that the model is well-calibrated.

Empirical evidence for these findings is shown below (left: invariance, right: ensemble verification).

Banner Image Banner Image

Problem. Calibration is missing on augmented data, but we largely observe that CLIP models are still pretty accurate in this regime. For example, here is what the reliability plots of CLIP-ViT-B-16 look like.

Banner Image

TTA with "zero" temperature is a direct consequence of these observations: since confidence information is unreliable, simply compute the marginal distribution after the temperature has been zeroed-out! By only adapting this parameter, we are effectively marginalizing across one-hot encoded vectors... does this remind you of something?

Implementation

ZERO is implemented in a few lines of code. You can find a PyTorch-like implementation right here :)

      
        def zero(image, z_txt, N, gamma, temp):
          """
          :param z_txt: pre-computed text embeddings (C,hdim)
          :param temp: model’s original temperature
          :param augment: takes (C,H,W) and returns (N,C,H,W)
          :param gamma: filtering percentile (e.g., 0.3)
          """
          views = augment(image, num_views=N) # generate augmented views
          l = model.image_encoder(views) @ z_txt.t() # predict (unscaled logits)
          l_filt = confidence_filter(l, temp, top=gamma) # retain most confident preds
          zero_temp = torch.finfo(l_filt.dtype).eps # zero temperature
          p_bar = (l_filt / zero_temp).softmax(dim=1).sum(dim=0) # marginalize
          return p_bar.argmax()
      
      

Results


We evaluate ZERO on the standard TTA benchmarks, including robustness to Natural Distribution Shifts and Fine-grained Classification. The results below report CLIP-ViT-B-16 from OpenAI, and compare ZERO to TPT, PromptAlign and RLCF.

Robustness to Natural Distribution Shifts

Banner Image

Fine-grained Classification

Banner Image


We find that ZERO, in all its simplicity, establishes a new state-of-the-art in TTA! Don't forget about majority voting when you evaluate your TTA method!! :)

Acknowledgements


The authors acknowledge the CINECA award under the ISCRA initiative for the availability of high-performance computing resources and support. Matteo Farina is supported by the PRIN project LEGO-AI (Prot.2020TA3K9N) and the PAT project AI@TN. This work was supported by the projects EU Horizon ELIAS (No. 101120237), AI4TRUST (No.101070190), FAIR - Future AI Research (PE00000013), funded by NextGeneration EU, and carried out in the Vision and Learning joint laboratory of Fondazione Bruno Kessler and the University of Trento, Italy.