[MM] Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
[MM] Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
[MM] Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
[MM] CogAgent: A Visual Language Model for GUI Agents
[MM] VLM2VEC: Training Vision-Language Models for Massive Multimodal Embedding Tasks
[MM] UniCode: Learning a Unified Codebook for Multimodal Large Language Models
[MM] SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation