[GM][TTT] Diffusion-TTA: Test-time Adaptation of Discriminative Models via Generative Feedback

2 minute read

1. Motivation

Diffusion 기반의 Generative Model의 성능이 향상됨에 따라 이를 어떻게 잘 활용하여 TTA를 수행할 수 있을까? 하는 의문이 생기게 됨
- discriminator의 경우, 를 mapping하는 과정을 학습하는데 비해, generator는 를 학습하게 됨 $\to$ Inverted using Bayes Rule을 하면 iterative 하게 inference를 할 수 있게됨
- Generator의 task가 더 어렵기 때문에, data의 maximum likelyhood를 학습하는 과정을 통해 data에 대한 미묘한 차이도 학습할 능력이 생길 수 있게됨 (Nuance understanding of data)
Discriminator가 생성한 결과를 가지고 Conditional Diffusion의 condition을 생성해서 maximum likelyhood를 높여보자!
- inverting generative model의 경우 OOD (Out-of-Distribution) 이미지에 대해 generalize한 특성이 있다고 알려짐 $\to$ TTA에 적합!

Pre-trained dicriminator (classifer, segmentor, depth predictor)와 Diffusion model을 동시에 학습하는 Diffusion Loss를 TTT (Test-time-Training) Loss로 활용하는 DIffusion-TTA 제안
다양한 Downstream task의 Large-scale discriminative model들 (Classifier, Semantic segmentation, Depth estimation)에서도 탁월한 성능 향상을 보임

Naive Method
- Generater의 inversion과 Discriminator의 Ensemble을 활용해보면 어떨까?
  \[\frac{p_{\theta}(y|x)+p_{\phi}(y|x)}{2}\]
  $\to$ 실험 결과, 성능 향상이 없었음
Test time에서 generator의 Loss (Maximizing sample’s likelyhood)를 줄이는 방향으로 Discriminator와 Generator를 학습하는 DIffusion-TTA를 제안함
- Encoder-Decoder구조와 비슷함
  - Encoder : Discriminator. Image에 대해 Class score를 예측함
    - Classification : ResNet, ConvNext, SegFormer(SSG), DenseDepth (Depth prediction)
    - Diffusion: DiT (ImageNet), Stable Diffusion (Open-vocabulary)
  - Decoder : Diffusion. Encoder의 class score를 condition으로 Denosing할 noise를 예측
- Total Loss
  - 오직 Diffusion의 Reconstruction Loss만 활용 (not joint training with classification loss)

c 생성과정
- Similairy classification과 동일하게, 학습된 L개의 class에 대한 text embedding $l_j \in \mathbb{R}^D $에 대해 $j \in {1,..,L}$ dot production로 계산
  - Classification
  - Semantic Segmentation
  - Depth Estimation
    
    c$=y$
Loss
- $\theta$: discriminator의 parameter. c생성 시 활용
- $\phi$: generator의 parameter
Algorithm
Implementation details
- batch를 키우는게 효과적임을 실험적으로 확인 $\to$ gradient accumulation을 수행함 (x5)
- 모든 실험에서 discriminator의 모든 parameter를 업데이트 수행함
- open-vocabulary실험에서는 diffusion을 freeze시키고, CLIP 실험에서는 Clip을 freeze하고 LoRA의 adapter weight만 학습시킴