[Layout] PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation

3 minute read

1. Motivation

기존의 content-aware Layout Generation은 2가지 한계가 있었음
- Image-centric enhancements를 중시했음
  - “Saliency map”기반의 saliency map을 입력하여 layout constraint을 제공 $\to$ 학습 데이터 부족으로 local solution space에 빠져, 낮은 퀄리티의 layout constraint를 생성함
- Rectangular shape만 고려했음
$\to$ LLM의 지식을 끌어낼 input으로 대체해볼까?

LLM의 지식을 활용 가능하도록 Language-based로 layout을 새롭게 표현하는 Layout Tree를 제안함
- Hirarchical node를 통해 “design intent”와 임의의 모양의 element를 표현
- SVG language로 표현
Intent-aligned layout만 생성하도록 design intent를 예측하는 detection 모델을 제안함
- RAG (Retrieval Augmented Generation) concept
- design intent가 유사한 포스터 데이터의 레이아웃을 참고
생성된 레이아웃을 poster design realization이 가능하도록 chat형식으로 rendering
Content-aware Layout Generation task에서 SOTA
새로운 metric도 추가로 제안함
- Intent-aware content metrics $\to$ 기존의 saliency-aware metrics보다 신뢰할만함
PStylish7라는 poster뿐만 아니라 더 다양한 도메인 (SNS, Paint, Poem, etc) 데이터셋을 제안함
- 152개의 few-shot / 100개의 test examples
- 7개의 목적 / 8개의 요소 타입으로 구성
- 다양한 난이도 (train / test category gap)
- 다양한 aspect ratio

Overall architecture
- Layout rendering까지 3 step
  - 퀴리 포스터에 대한 layout tree construction $\to$ 쿼리 포스터와 디자인 의도가 비슷한 k-shot 후보 포스터들의 layout을 사용하여 layout tree generation $\to$ LLM 모델 M과 chat을 통한 poster design realization
Why language-based?
- LLM의 내재된 지식을 잘 활용하기 위해서는 language로만 layout를 표현해보자는 취지!
- Image, layout paired dataset이 LLM을 학습할때 사용한 (language-only) 데이터보다 턱없이 부족하기 때문이기도 함
How to represent layout in language?
- SVG language 기반의 Layout Tree로 표현
  - Why SVG language? $\to$ rendering이 가능한 언어
    - SVG-language로 랜더링하여 구현함 $\to$ 미리캔버스에서도?

Unverisal shape Vectorization $\to$ SVG standard를 차용함 [9]
- Ellipse (타원)
- Rectangle (직사각형)
- Rotated Rectangle (마름모)
- Complex curve (곡선): Cubic Bezier 곡선 함수를 사용
- Polygon: design intent area 표시할때만 사용
Design Intent
- 주어진 이미지 내에서, 요소들이 배치할 수 있는 영역을 나타냄 (polygon $\to$ 다각형)
- U-Net 기반의 design intent detection model S를 준지도 학습하기 위해 MSE loss로 학습
  - design intent bitmap을 예측 (segmentation model)
- U-Net의 Encoder output vector $\bold{Z}_D$는 design intent vector로, 향후 retrieval할 때 사용
Hirarchical Node Representation
- Image 해상도를 고려한 SVG container로 구성
- Underlay의 영역을 기준으로 정렬
- Underlay 요소의 경우, 아래 식을 만족할 경우 subtree node에 상대적 거리좌표를 기준으로 표현
  - 기준
  - 상대좌표 예시
    
    $<svg \quad x={N_a.x}\quad y={N_b.y}>{N_a’}{N_B’}</svg>$
- 요소별 Unitque Id 부여 ${c_i}$

Intent-aligned Example Selection
- 기존에는 Saliency map (혹은 Query image)를 기준으로 유사한 poster를 Retrieval 하여 참고할 레이아웃을 in-context learning 예시로 LLM에 넣었음
- 본 논문은 Design Intent Model의 latent vector $\bold{Z}_D$를 기준으로 retrieval 수행
- 검색된 candidate poster의 layout은 prompt template을 통해 짧은 글 형태로 변환