[Layout] LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
[Layout] LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
[Layout] LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
[MM] Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
[MM][GUI] V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
[MM] Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
[MM] CogAgent: A Visual Language Model for GUI Agents