New Multimodal AI Solution Combines Lightweight Vision Models and Large Language Models in Collaboration

According to current research, AI can learn new sensory modalities, such as vision, without needing to rebuild the entire system from scratch. The Be My Eyes project by James Y. Huang and colleagues presents a framework where different AI agents collaborate: lightweight vision-specialized models observe, while large language models think and decide.

The underlying problem is that large vision-language models, which combine images and text, are expensive to develop and require massive computational power. Smaller vision-language models are more energy-efficient and easier to modify, but they lack the broad general knowledge and reasoning ability of large language models.

The Be My Eyes solution circumvents this contradiction by dividing tasks between two agents. The vision agent interprets images or other visual material and converts it into a conversation, based on which the language agent performs the actual reasoning. The interaction between the agents resembles human collaboration: one observes and describes, the other analyzes and suggests solutions.

The research team also presents a data creation and training package specifically for training the vision agent for this collaboration. Using synthetic data, the model is taught how to converse with the language agent to provide the necessary information and fully utilize its reasoning capabilities.

This approach is significant because it may allow the addition of new sensory modalities to large language models without building a massive, all-encompassing model each time. Instead, it is possible to combine lighter specialized models and leverage the dialogue between them and the language models.

Source: Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration, ArXiv (AI).