A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

3 minute read

[WebAgent] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

paper: https://arxiv.org/pdf/2307.12856
github: X
ICLR 2024 accpetedd (인용수: 252회, ‘25-05-28 기준)
downstream task: Web automation, Generalist WebAgent, Grounded Code Generation

1. Motivation

Real world에서 LLM Agent는 Simulated website와는 다르게 3가지 이슈에 봉착한다.

Predefined action space가 없는 dynamic한 (open-domain) 환경
Too long HTML document
No inductive bias for specific web domain

(미리캔버스 한정이라면 predefined action space로 한정하고, html document도 rule-base로 간략화할 수 있지 않나??)

$\to$ HTML 도메인 특화 LM을 만들어보자!

2. Contribution

Real-world에서 self-supervision 기반 2개의 LLM modular (HTML-T5)로 구성된 Web Agent를 제안
- HTML-T5: planning (instruction2sub-instruction), HTML summarization
- Flan-U-PaLM : executable python program을 제공하여, open-ended action도 생성할 수 있음
HTML Domain Expert model HTML-T5의 새로운 학습 방식 제안
- long-span-denoising objective: 핵심 내용이 sparse하게 존재하는 HTML document 특성을 반영하여, 기존에 masking (noising) spawn=3을 [8, 16] 사이로 늘리고 denoising을 학습시킴으로써, HTML document의 구조를 더잘 이해하도록 학습하는 식
- 대용량의 HTML corpus (CommonCrawl)에서 추출한 HTML 문서를 기반으로 학습
Real-world Web automation task (MiniWoB++, Mind2Web)에서 SOTA (vs. ChatGPT)

3.1 Web Automation

Web Automation task는 sequential action prediction task임
기존에는 단순화된 simulator를 통해 개발되고 있었음. 이는 실제 Web 환경과는 다르게 html code수가 15배 이상 작고, predefined된 action이 존재하여, out-of-distribution web page에는 적용하는데 한계가 있음

3.2 Program Synthesis

LLM에게 code generation 혹은 tool-use를 위한 api-call를 연구하고 있음

4. WebAgent

Overall Architecture
- HTML-T5: planning (instruction 2 sub-intructions)과 연관된 snippet HTML을 생성 (summarization) 기능
- Flan-U-PaLM: HTML-T5의 출력값을 기반으로 Web Automation을 위한 action을 executable program형태로 생성

4.1 HTML-T5

Model Architecture

Encoder
- Dense attention $\to$ Local & Global attention
  - Sparse한 HTML document 특성을 반영하여, local한 영역 (span=8)만 attention하는 local attention과 transcient global attention을 통해 HTML의 syntax와 semantic 이해능력을 높임
- 확장된 span length ($\mu$)
  - 기존에 일반적인 document에 대해 denoising 학습시에는 $\mu=3$을 썼으나, sparse한 HTML 특성을 반영하여 $\mu=8$을 사용
Decoder
- LongT5 논문을 따라 dense attention 그대로 적용

PreTraining with Mixture of Long-Span Denoising

$\mu$ span=3일 경우 (기존 방식)

</, id=, ">같은 의미없는 단어단위가 masking되어 복원되도록 학습되므로 HTML의 학습 성능이 떨어짐
$\mu$ span=8일 경우

<form class=, type="submit"같은 의미단위로 masking되어 복원되도록 학습되므로 HTML의 학습 성능이 향상됨

$\to$ $\mu \in {8,64}$ 를 중위값으로 하는 guassian sampling을 도입
input sequence length = 4096
output sequence length = 910
masking ratio = 15%
dataset = 100 WARC files (CommonCrawl)
- non-Unicode / alphanumeric-only HTML은 제거
- <label> 내의 for attribute가 있는 요소들만 추출하여 학습에 활용함 (nosie label 제거 목적)
  
  $\to$ 3.41M samples로 학습

4.2 Self-Experience Supervision for Alignment with Real Websites

Planning, summurization, program synthesis step별 라벨을 real-world에서 얻기는 매우 어려우므로, self-experience supervision을 사용함

Instruction Template

*Show me the way from to by at map website*
- key:value (start:”Seoul”) 형식으로 pre-defined 된 dictionary를 사용하여 instruction을 생성함

Scripted Planning & Prompted Programming

Scripted Planning
- Rule-baesd parser를 써서 instruction을 sub-instruction으로 분해함
  - regular expression을 활용하여 sub-instruction과 연관된 ID는 검색됨
Prompted Programming
- Flan-U-PaLM은 매 step마다 위의 sub-instruction과 ID값을 입력받아, Selenium WebDriver에 execute할 program을 생성함
$\to$ execute할 program이 execution erro가 나거나, regular expression로부터 retrieved error가 나거나, URL prefix가 에러나 날경우 환경(HTML Selenium)이 feedback 수행

Finetuning for Planning & Summarization

HTML-T5는 위 self-experience supervision 방식으로 얻어진 데이터로부터 학습을 수행함
- input
  - task insturction (please search 2 bedroom and 2+ bathroom houses in new your, ny with a max price of $7500 on real estate website)
  - sub-instruction histories (go to real estate website, type in new york into search, click on search, click on price, click on max rent)
  - raw HTML
- output
  - next sub-instruction (type in 7500 into max rent)
  - 연관된 HTML snippets에서 data-ref 추출

4.3 Grounded Program Synthesis

Real-world website의 특성상 Open-ended action space임 $\to$ act via programming paradigm으로 문제를 해결하고자 함
Model: Flan-U-PaLM
Environment: Selenium WebDriver
Input
- Sub-instructions: 주석으로 사용
- few-shot examples
  
  ex. selecting checkboxes, entering text into inputs, click on options, scrolling, etc
- next sub-instruction
- summarized HTML snippet
Output
- Executable code

5. Experiments

Datasets
- Realworld website: real-estate, social-media, map을 가지고 검증셋 구축
- WebSRC: static HTML comprehension benchmark
- MiniWoB++: simulated web benchmark
  - 56 simulated task + 100 evaluation episodes per task
  - real-world dataset의 입력과 동일함
  - 출력은 pre-defined action을 가짐.
- Mind2Web: offline task planning benchmark
  - 137 website + 2K instructions
  - real-world dataset과 비슷한 입력을 가짐 (HTML snippet, action history, task instruction)
  - 출력은 target element와 action + attribute를 가짐
Evaluation
- attribute별로 별도로 score를 메김
  
  ex. Can you search for a studio bedroom, 1+ bathroom apartments in oroville, ca for corporate housing on real estate website?
  
  이면, (1) apartment, (2) *corporate housing (3) studio bedroom (4) oroville, ca (5) 1+bedroom $\to$ (1),(2),(5)만 맞출 시, 3/5x100=60점
- real-world performance 측정을 위해 3가지 사이트별로 각각 260, 230, 410 epoisode를 추출
  - real-estate (20-steps-per-episode, at least 2 pages transition)
  - social-media (10-steps-per-episode, at least 4 pages transition)
  - map (easiest)
- 20개의 instruction template를 준비하여 success rate & score를 매김
  - Real-World dataset Result
Ablation Studies
- Mind2Web
  - HTML-T5를 해당 data로 finetuning
- MiniWoB++
  - HTML-T5를 해당 data로 finetuning
- Local&Global attention 유무 / longer span denoising 유무에 따른 성능 향상 여부
- Model scale별 성능 향상