[MM] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

1 minute read

Abstract

MLLM의 다양한 architecture component 별, data choice별 학습 결과를 분석함
- ex. Image Captioning only vs. Imange Captioning + Text-only dataset
핵심적인 design lession 우순선위를 제안:
- Image resolution = image-token 갯수 > visual connector 디자인

Architecture
- Image encoder: ViT-L/14 with CLIP loss; resolution 336x336
- Connector: C-Abstractor / Average Pooling
- Language Model : 1.2B / 3.5B
Data
- Caption images (45%) + Interleaved image-text documents (45%) + text-only (10%)
Training procedure
- Contrastive Loss > Reconstruction Loss
Encoder Lesson
- Image resolution > Model size > Training data composition
VL Connector Lesson:
- visual token / image resolution은 영향을 끼침
- VL connector type은 영향이 없음
Data Lesson
- Captioning data : zero-shot performance에 긍정적
- Interleaved image-text data + text-only data: few-shot performance에 긍정적
- Captioned image : Interleaved image-text data : text-only data = 4:4:1 이 제일 성능 향상에 좋았음

Image encoder: ViT-H/14 with CLIP loss; resolution 378x378
Connector: C-Abstractor (144 token)
Data: Captioned image : Interleaved image-text data : text-only data = 4:4:1
Model scaling: Model scale에 따라 최적의 learning rate를 extrapolate하여 도출
- $\eta$: learning rate
- N: # of parameters
MoE
- top-2 gating으로 scaling 효과를 볼 수 있음 (LLM decoder-only 부분만 변경)
Final Pre-training Result

LLaVA-NeXT와 같이 GPT-4V가 생성한 caption으로 학습 / 검증 데이터 수집 (1.45M examples)
Higher resolution
- Pre-training 할때보다 고해상도 학습을 위해 positional embedding interpolation 수행 $\to$ 672x672
- sub-image decompostion: 1344x1344 이미지를 672x672로 resize / sub-image로 나누어 처리
Lesson: Pre-training이 SFT의 성능 향상에 기여함