打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之简介
导读:本文是对展示视觉和视觉语言能力的多模态基础模型的全面调查,重点关注从专业模型向通用助手的过渡。本文涵盖了五个核心主题:
>> 视觉理解:本部分探讨了学习视觉骨干用于视觉理解的方法,包括监督预训练、对比语言图像预训练、仅图像的自监督学习以及多模态融合、区域级和像素级预训练。
>> 视觉生成:本部分讨论了视觉生成的各个方面,如视觉生成中的人类对齐、文本到图像生成、空间可控生成、基于文本的编辑、文本提示跟随以及概念定制。
>> 统一视觉模型:本部分考察了视觉模型从封闭集到开放集模型、任务特定模型到通用模型、静态到可提示模型的演变。
>> 加持LLMs的大型多模态模型:本部分探讨了使用大型语言模型(LLMs)进行大型多模态模型训练,包括在LLMs中进行指导调整以及指导调整的大型多模态模型的开发。
>> 多模态智能体:本部分重点关注将多模态工具与LLMs链接以创建多模态代理,包括对MM-REACT的案例研究、高级主题以及多种多模态代理的应用。
本文的关键论点包括视觉在人工智能中的重要性,多模态基础模型从专业模型向通用助手的演变,以及大型多模态模型和多模态代理在各种应用中的潜力。本文旨在面向计算机视觉和视觉语言多模态社区的研究人员、研究生和专业人士,他们希望了解多模态基础模型的基础知识和最新进展。
相关文章
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之简介
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助-CSDN博客
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之视觉理解、视觉生成
AGI之MFM:《多模态基础模型:从专家到通用助手》翻译与解读之视觉理解、视觉生成_一个处女座的程序猿的博客-CSDN博客
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之统一的视觉模型、加持LLMs的大型多模态模型
AGI之MFM:《多模态基础模型:从专家到通用助手》翻译与解读之统一的视觉模型、加持LLMs的大型多模态模型-CSDN博客
AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之与LLM协同工作的多模态智能体、结论和研究趋势
AGI之MFM:《多模态基础模型:从专家到通用助手》翻译与解读之与LLM协同工作的多模态智能体、结论和研究趋势-CSDN博客
《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读
地址
论文地址:https://arxiv.org/abs/2309.10020
时间
2023年9月18日
作者
微软团队
Abstract
研究了两类的5个核心主题:成熟的研究领域(视觉主干法+文本到图像生成)、探索性开放研究领域(基于LLM的统一视觉模型+多模态LLM的端到端训练+与LLMs链接多模态工具)
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language ca-pabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multi-modal foundation models pre-trained for specific purposes, including two topics –methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics – unified vision models inspired by large lan-guage models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language mul-timodal communities who are eager to learn the basics and recent advances in multimodal foundation models.
本文提供了一份综合性的调查,涵盖了展示视觉和视觉语言能力的多模态基础模型的分类和演化,重点关注了从专业模型过渡到通用助手的过程。研究领域涵盖了五个核心主题,分为两类。
(i)我们首先调查了一些成熟的研究领域:为特定目的预先训练的多模态基础模型,包括两个主题——用于视觉理解的学习视觉主干方法和文本到图像的生成。。
(ii)然后,我们介绍了探索性开放研究领域的最新进展:旨在扮演通用助手角色的多模态基础模型,包括三个主题 - 受大型语言模型(LLMs)启发的统一视觉模型、多模态LLMs的端到端训练以及与LLMs链接多模态工具。本文的目标受众是渴望学习多模态基础模型的基础知识和最新进展的计算机视觉和视觉语言社区的研究人员、研究生和专业人员。
1 Introduction
Vision is one of the primary channels for humans and many living creatures to perceive and interact with the world. One of the core aspirations in artificial intelligence (AI) is to develop AI agents to mimic such an ability to effectively perceive and generate visual signals, and thus reason over and interact with the visual world. Examples include recognition of the objects and actions in the scenes, and creation of sketches and pictures for communication. Building foundational models with visual capabilities is a prevalent research field striving to accomplish this objective.
视觉是人类和许多生物感知和与世界互动的主要渠道之一。人工智能(AI)的核心目标之一是开发AI代理,模仿这种有效感知和生成视觉信号的能力,从而在视觉世界中进行推理和互动。示例包包括对场景中物体和动作的识别,以及创建用于沟通的草图和图片。建立具有视觉能力的基础模型是一个广泛存在的研究领域,旨在实现这一目标。
特定任务的模型(需从头开始训练)→基于语言理解和生成的语言模型(为适应下游任务提供基础,如BERT/GPT-2等)→基于大统一的大型语言模型LLMs(出现新兴能力【如ICL/CoT】,如GPT-3等)→基于通用助手的LLMs(有趣的能力【互动和使用工具】,如ChatGPT/GPT-4)
Over the last decade, the field of AI has experienced a fruitful trajectory in the development of models. We divide them into four categories, as illustrated in Figure 1.1. The categorization can be shared among different fields in AI, including language, vision and multimodality. We first use language models in NLP to illustrate the evolution process. (i) At the early years, task-specific mod- els are developed for individual datasets and tasks, typically being trained from scratch. (ii) With large-scale pre-training, language models achieve state-of-the-art performance on many established language understanding and generation tasks, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2020), DeBERTa (He et al., 2021) and GPT-2 (Radford et al., 2019)). These pre-trained models serve the basis for downstream task adaptation. (iii) Exemplified by GPT-3 (Brown et al., 2020), large language models (LLMs) unify various language understanding and generation tasks into one model. With web-scale training and unification, some emerging ca- pabilities appear, such as in-context-learning and chain-of-thoughts. (iv) With recent advances in human-AI alignment, LLMs start to play the role of general-purpose assistants to follow human intents to complete a wide range of language tasks in the wild, such as ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023a). These assistants exhibit interesting capabilities, such as interaction and tool use, and lay a foundation for developing general-purpose AI agents. It is important to note that the latest iterations of foundation models build upon the noteworthy features of their earlier counterparts while also providing additional capabilities.
在过去的十年里,AI领域在模型的发展方面经历了一个富有成效的轨迹。我们将它们分为四个类别,如图1.1所示。这种分类可以在AI的不同领域共享,包括语言、视觉和多模态。我们首先使用自然语言处理(NLP)中的语言模型来说明演化过程。
(i) 在早期,为个别数据集和任务开发了特定任务的模型,通常是从头开始训练的。
(ii) 借助大规模预训练,语言模型在许多已建立的语言理解和生成任务上取得了最先进的性能,如BERT(Devlin等人,2019)、RoBERTa(Liu等人,2019)、T5(Raffel等人,2020)、DeBERTa(He等人,2021)和GPT-2(Radford等人,2019)。这些预训练模型为下游任务的适应提供了基础。
(iii) 以GPT-3(Brown等人,2020)为代表,大型语言模型(LLMs)将各种语言理解和生成任务统一到一个模型中。通过规模庞大的训练和统一,出现了一些新兴的能力,如上下文学习ICL和思维链CoT。
(iv) 随着人工智能与人类意图的对齐的最新进展,LLMs开始扮演通用助手的角色,遵循人类的意图,在域外完成各种语言任务,如ChatGPT(OpenAI,2022)和GPT-4(OpenAI,2023a)。这些助手展示出了有趣的能力,如互动和工具使用,并为开发通用AI代理奠定了基础。重要的是要注意,最新的基础模型版本在保留其早期版本的显著特征的基础上,还提供了额外的功能。
Inspired by the great successes of LLMs in NLP, it is natural for researchers in the computer vision and vision-language community to ask the question: what is the counterpart of ChatGPT/GPT-4 for vision, vision-language and multi-modal models? There is no doubt that vision pre-training and vision-language pre-training (VLP) have attracted a growing attention since the birth of BERT, and has become the mainstream learning paradigm for vision, with the promise to learn universal transferable visual and vision-language representations, or to generate highly plausible images. Ar- guably, they can be considered as the early generation of multimodal foundation models, just as BERT/GPT-2 to the language field. While the road-map to build general-purpose assistants for lan- guage such as ChatGPT is clear, it is becoming increasingly crucial for the research community to explore feasible solutions to building its counterpart for computer vision: the general-purpose visual assistants. Overall, building general-purpose agents has been a long-standing goal for AI. LLMs with emerging properties have significantly reduced the cost of building such agents for language tasks. Similarly, we foresee emerging capabilities from vision models, such as following the instruc- tions composed by various visual prompts like user-uploaded images, human-drawn clicks, sketches and mask, in addition to text prompt. Such strong zero-shot visual task composition capabilities can significantly reduce the cost of building AI agents.
受到LLMs在NLP领域的巨大成功的启发,计算机视觉和视觉语言社区的研究人员很自然地会提出这样的问题:对于视觉、视觉语言和多模态模型,ChatGPT/GPT-4的对应物是什么?
毫无疑问,自BERT诞生以来,视觉预训练和视觉语言预训练(VLP)已经引起了越来越多的关注,并已成为视觉领域的主流学习范式,有望学习普遍可转移的视觉和视觉语言表征,或生成高度可信的图像。可以说,它们可以被视为多模态基础模型的早期代表,就像BERT/GPT-2对语言领域一样。虽然建立像ChatGPT这样的语言通用助手的路线图已经明确,但对于计算机视觉的对应物——通用视觉助手,研究社区越来越需要探索可行的解决方案。
总的来说,构建通用代理已经是AI的长期目标。LLMs的新兴特性极大地降低了为语言任务构建这种代理的成本。同样,我们预见到视觉模型的新兴能力,例如除了文本提示外,还可以遵循由各种视觉提示组成的指令,如用户上传的图像、人工绘制的点击、草图和掩码,这种强大的零样本视觉任务组成能力可以显著降低构建AI代理的成本。
图1.1:语言和视觉/多模态基础模型发展轨迹的示意图
Figure 1.1: Illustration of foundation model development trajectory for language and vision/multi- modality. Among the four categories, the first category is the task-specific model, and the last three categories belong to foundation models, where these foundation models for language and vision are grouped in green and blue blocks, respectively. Some prominent properties of models in each category are highlighted. By comparing the models between language and vision, we are foreseeing that the transition of multimodal foundation models follows a similar trend: from the pre-trained model for specific purpose, to unified models and general-purpose assistants. However, research 1 exploration is needed to figure out the best recipe, which is indicated as the question mark in the figure, as multimodal GPT-4 and Gemini stay private.
图1.1:语言和视觉/多模态的基础模型发展轨迹示意图。在四个类别中,第一类别是特定于任务的模型,而最后三个类别属于基础模型,其中语言和视觉的基础模型分别以绿色和蓝色方块分组。突出显示了每个类别中模型的一些显著特性。通过比较语言和视觉之间的模型,我们预见多模态基础模型的转变会遵循类似的趋势:从针对特定目的的预训练模型到统一模型和通用助手。然而,因为多模态GPT-4和Gemini仍然保密,仍需要进行研究探索以找出最佳的方法,这在图中用问号表示。
In this paper, we limit the scope of multimodal foundation models to the vision and vision-language domains. Recent survey papers on related topics include (i) image understanding models such as self-supervised learning (Jaiswal et al., 2020; Jing and Tian, 2020; Ozbulak et al., 2023), segment anything (SAM) (Zhang et al., 2023a,c), (ii) image generation models (Zhang et al., 2023b; Zhou and Shimada, 2023), and (iii) vision-language pre-training (VLP). Existing VLP survey papers cover VLP methods for task-specific VL problems before the era of pre-training, image-text tasks, core vision tasks, and/or video-text tasks (Zhang et al., 2020; Du et al., 2022; Li et al., 2022c; Ruan and Jin, 2022; Chen et al., 2022a; Gan et al., 2022; Zhang et al., 2023g). Two recent survey papers cover the integration of vision models with LLM (Awais et al., 2023; Yin et al., 2022).
在本文中,我们将多模态基础模型的范围限定在视觉和视觉语言领域。最近关于相关主题的调查论文包括
(i)图像理解模型,如自监督学习(Jaiswal等,2020;Jing和Tian,2020;Ozbulak等,2023)、任何物体分割(SAM)(Zhang等,2023a,c)、
(ii)图像生成模型(Zhang等,2023b;Zhou和Shimada,2023),以及
(iii)视觉语言预训练(VLP)。现有的VLP调查论文涵盖了VLP方法,用于在预训练时代之前的特定任务的VL问题,图像-文本任务,核心视觉任务和/或视频文本任务(Zhang等,2020;Du等,2022;Li等,2022c;Ruan和Jin,2022;Chen等,2022a;Gan等,2022;Zhang等,2023g)。最近有两篇调查论文涵盖了将视觉模型与LLM集成在一起的问题(Awais等,2023;Yin等,2022)。
Among them, Gan et al. (2022) is a survey on VLP that covers the CVPR tutorial series on Recent Advances in Vision-and-Language Research in 2022 and before. This paper summarizes the CVPR tutorial on Recent Advances in Vision Foundation Models in 2023. Different from the aforemen- tioned survey papers that focus on literature review of a given research topic, this paper presents our perspectives on the role transition of multimodal foundation models from specialists to general- purpose visual assistants, in the era of large language models. The contributions of this survey paper are summarized as follows.
>> We provide a comprehensive and timely survey on modern multimodal foundation models, not only covering well-established models for visual representation learning and image generation, but also summarizing emerging topics for the past 6 months inspired by LLMs, including unified vision models, training and chaining with LLMs.
>> The paper is positioned to provide the audiences with the perspective to advocate a transition in developing multimodal foundation models. On top of great modeling successes for specific vi- sion problems, we are moving towards building general-purpose assistants that can follow human intents to complete a wide range of computer vision tasks in the wild. We provide in-depth discus- sions on these advanced topics, demonstrating the potential of developing general-purpose visual assistants.
其中,Gan等(2022)是一份关于VLP的调查,涵盖了2022年及以前的CVPR视觉与语言研究的最新进展教程系列。本文总结了2023年的CVPR视觉基础模型最新进展教程。与前述调查论文不同,它们侧重于给定研究主题的文献综述,本文提出了我们对多模态基础模型从专业人员向通用视觉助手的角色转变的看法,在大型语言模型时代。本调查论文的贡献总结如下。
>> 我们对现代多模态基础模型进行了全面和及时的调查,不仅涵盖了视觉表示学习和图像生成的成熟模型,还总结了受LLMs启发的过去6个月出现的新兴主题,包括统一视觉模型、与LLMs的训练和链接。
>>本文旨在为读者提供一个视角,倡导发展多模态基础模型的转型。在对特定视觉问题建模取得巨大成功的基础上,我们正朝着构建通用助手的方向发展,它可以跟随人类的意图,在域外完成广泛的计算机视觉任务。我们对这些高级主题进行了深入的讨论,展示了开发通用视觉助手的潜力。
1.1、What are Multimodal Foundation Models?什么是多模态基础模型?
两大技术背景=迁移学习(成为可能)+规模定律(变强大),2018年底的BERT标志着基础模型时代的开始
As elucidated in the Stanford foundation model paper (Bommasani et al., 2021), AI has been under- going a paradigm shift with the rise of models (e.g., BERT, GPT family, CLIP (Radford et al., 2021) and DALL-E (Ramesh et al., 2021a)) trained on broad data that can be adapted to a wide range of downstream tasks. They call these models foundation models to underscore their critically central yet incomplete character: homogenization of the methodologies across research communities and emergence of new capabilities. From a technical perspective, it is transfer learning that makes foun- dation models possible, and it is scale that makes them powerful. The emergence of foundation models has been predominantly observed in the NLP domain, with examples ranging from BERT to ChatGPT. This trend has gained traction in recent years, extending to computer vision and other fields. In NLP, the introduction of BERT in late 2018 is considered as the inception of the foundation model era. The remarkable success of BERT rapidly stimulates interest in self-supervised learning in the computer vision community, giving rise to models such as SimCLR (Chen et al., 2020a), MoCo (He et al., 2020), BEiT (Bao et al., 2022), and MAE (He et al., 2022a). During the same time period, the success of pre-training also significantly promotes the vision-and-language multimodal field to an unprecedented level of attention.
正如斯坦福基础模型论文(Bommasani等,2021)所阐明的那样,随着模型的崛起(例如BERT、GPT家族、CLIP(Radford等,2021)和DALL-E(Ramesh等,2021a)),人工智能正在经历一场范式转变,这些模型是在广泛数据上训练的,可以适应各种下游任务。他们将这些模型称为基础模型,以强调其关键的核心但不完整的特征:跨研究社区的方法同质化和新能力的出现。
从技术角度来看,正是迁移学习使得基础模型成为可能,规模(定律)使得它们变得强大。基础模型的出现主要在NLP领域观察到,从BERT到ChatGPT都有例证。这一趋势近年来逐渐受到重视,扩展到计算机视觉和其他领域。在NLP领域,2018年底引入BERT被视为基础模型时代的开始。BERT的显著成功迅速激发了计算机视觉社区对自监督学习的兴趣,催生了模型如SimCLR(Chen等,2020a)、MoCo(He等,2020)、BEiT(Bao等,2022)和MAE(He等,2022a)的出现。在同一时期,预训练的成功也显著地将视觉和语言多模态领域提升到前所未有的关注水平。
图1.2:本文旨在解决的三个多模态基础模型代表性问题的示意图
Figure 1.2: Illustration of three representative problems that multimodal foundation models aim to solve in this paper: visual understanding tasks , visual generation tasks , and general-purpose interface with language understanding and generation.
图1.2:本文旨在解决的三个多模态基础模型代表性问题的示意图:视觉理解任务、视觉生成任务以及与语言理解和生成的通用接口。
In this paper, we focus on multimodal foundation models, which inherit all properties of foundation models discussed in the Stanford paper (Bommasani et al., 2021), but with an emphasis on models with the capability to deal with vision and vision-language modalities. Among the ever-growing literature, we categorize multimodal foundation models in Figure 1.2, based on their functional- ity and generality. For each category, we present exemplary models that demonstrate the primary capabilities inherent to these multimodal foundation models.
在本文中,我们专注于多模态基础模型,它继承了斯坦福论文中讨论的基础模型的所有属性(Bommasani et al., 2021),但重点关注具有处理视觉和视觉语言模态能力的模型。在不断增长的文献中,我们根据其功能和通用性对图1.2中的多模态基础模型进行了分类。对于每个类别,我们提供了示例模型,展示了这些多模态基础模型固有的主要能力。
(1)、Visual Understanding Models. 视觉理解模型:三级范围(图像级→区域级→像素级),三类方法(基于监督信号不同,标签监督【如ImageNet】/语言监督【如CLIP和ALIGN】/仅图像自监督【如对比学习等】/多模态融合【如CoCa/Flamingo/GLIP/SAM】)
(Highlighted with orange in Figure 1.2) Learning general visual representations is essential to build vision foundation models, as pre-training a strong vision back- bone is foundamental to all types of computer vision downstream tasks, ranging from image-level (e.g., image classification, retrieval, and captioning), region-level (e.g., detection and grounding) to pixel-level tasks (e.g., segmentation). We group the methods into three categories, depending on the types of supervision signals used to train the models.
>>Label supervision. Datasets like ImageNet (Krizhevsky et al., 2012) and ImageNet21K (Rid- nik et al., 2021) have been popular for supervised learning, and larger-scale proprietary datasets are also used in industrial labs (Sun et al., 2017; Singh et al., 2022b; Zhai et al., 2022a).
>>Language supervision. Language is a richer form of supervision. Models like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) are pre-trained using a contrastive loss over millions or even billions of noisy image-text pairs mined from the Web. These models enable zero- shot image classification, and make traditional computer vision (CV) models to perform open-vocabulary CV tasks. We advocate the concept of computer vision in the wild,1 and encourage the development and evaluation of future foundation models for this.
>>Image-only self-supervision. This line of work aims to learn image representations from su- pervision signals mined from the images themselves, ranging from contrastive learning (Chen et al., 2020a; He et al., 2020), non-contrastive learning (Grill et al., 2020; Chen and He, 2021; Caron et al., 2021), to masked image modeling (Bao et al., 2022; He et al., 2022a).
>>Multimodal fusion, region-level and pixel-level pre-training. Besides the methods of pre- training image backbones, we will also discuss pre-training methods that allow multimodal fusion (e.g., CoCa (Yu et al., 2022a), Flamingo (Alayrac et al., 2022)), region-level and pixel- level image understanding, such as open-set object detection (e.g., GLIP (Li et al., 2022e)) and promptable semgentation (e.g., SAM (Kirillov et al., 2023)). These methods typically rely on a pre-trained image encoder or a pre-trained image-text encoder pair.
(在图1.2中以橙色突出显示)学习一般性的视觉表示对于构建视觉基础模型至关重要,因为预训练一个强大的视觉骨干是所有类型的计算机视觉下游任务的基础,范围从图像级(例如,图像分类、检索和字幕)、区域级(例如,检测和接地)到像素级任务(例如,分割)。根据用于训练模型的监督信号的类型,我们将这些方法分为三类。
>>标签监督。像ImageNet(Krizhevsky等,2012)和ImageNet21K(Ridnik等,2021)这样的数据集一直以来都在监督学习中很受欢迎,工业实验室也使用了规模更大的专有数据集(Sun等,2017;Singh等,2022b;Zhai等,2022a)。
>>语言监督。语言是一种更丰富的监督形式。像CLIP(Radford等,2021)和ALIGN(Jia等,2021)这样的模型是使用从网络上挖掘的数百万甚至数十亿的噪声图像-文本对的对比损失进行预训练的。这些模型使得零采样图像分类成为可能,并使传统的计算机视觉(CV)模型能够执行开放词汇的CV任务。我们倡导域外计算机视觉的概念,并鼓励为此开发和评估未来的基础模型。
>>仅图像自监督。这一领域的工作旨在从图像本身挖掘监督信号,学习图像表示,包括对比学习(Chen等,2020a;He等,2020)、非对比学习(Grill等,2020;Chen和He,2021;Caron等,2021)以及图像掩码建模(Bao等,2022;He等,2022a)等。
>>多模态融合、区域级别和像素级预训练。除了预训练图像主干的方法外,我们还将讨论允许多模态融合(例如CoCa(Yu等,2022a)、Flamingo(Alayrac等,2022))以及区域级别和像素级别图像理解的预训练方法,例如开放式目标检测(例如GLIP(Li等,2022e))和可提示分割(例如SAM(Kirillov等,2023))。这些方法通常依赖于预训练的图像编码器或预训练的图像-文本编码器对。
(2)、Visual Generation Models. 视觉生成模型:三大技术(量量化VAE方法/基于扩散的模型/回归模型),两类(文本条件的视觉生成【文本到图像生成模型{DALL-E/Stable Diffusion/Imagen}+文本到视频生成模型{Imagen Video/Make-A-Video}】、人类对齐的视觉生成器)
(Highlighted with green in Figure 1.2) Recently, foundation image generation models have been built, due to the emergence of large-scale image-text data. The techniques that make it possible include the vector-quantized VAE methods (Razavi et al., 2019), diffusion-based models (Dhariwal and Nichol, 2021) and auto-regressive models.
>>Text-conditioned visual generation. This research area focuses on generating faithful vi- sual content, including images, videos, and more, conditioned on open-ended text descrip- tions/prompts. Text-to-image generation develops generative models that synthesize images of high fidelity to follow the text prompt. Prominent examples include DALL-E (Ramesh et al., 2021a), DALL-E 2 (Ramesh et al., 2022), Stable Diffusion (Rombach et al., 2021; sta, 2022), Imagen (Saharia et al., 2022), and Parti (Yu et al., 2022b). Building on the success of text- to-image generation models, text-to-video generation models generate videos based on text prompts, such as Imagen Video (Ho et al., 2022) and Make-A-Video (Singer et al., 2022).
>>Human-aligned visual generator. This research area focuses on improving the pre-trained visual generator to better follow human intentions. Efforts have been made to address vari- ous challenges inherent to base visual generators. These include improving spatial control- lability (Zhang and Agrawala, 2023; Yang et al., 2023b), ensuring better adherence to text prompts (Black et al., 2023), supporting flexible text-based editing (Brooks et al., 2023), and facilitating visual concept customization (Ruiz et al., 2023).
(在图1.2中以绿色突出显示)近年来,由于大规模图像-文本数据的出现,建立了基础图像生成模型。使之成为可能的技术包括向量量化VAE方法(Razavi等,2019)、基于扩散的模型(Dhariwal和Nichol,2021)和自回归模型。
>>文本条件的视觉生成。这一研究领域专注于生成忠实于开放性文本描述/提示的视觉内容,包括图像、视频等。文本到图像生成开发了生成模型,能够根据文本提示合成高保真度的图像。著名的示例包括DALL-E(Ramesh等,2021a)、DALL-E 2(Ramesh等,2022)、Stable Diffusion(Rombach等,2021;sta,2022)、Imagen(Saharia等,2022)和Parti(Yu等,2022b)。在文本到图像生成模型取得成功的基础上,文本到视频生成模型基于文本提示生成视频,如Imagen Video(Ho等,2022)和Make-A-Video(Singer等,2022)等。
>>人类对齐的视觉生成器。这一研究领域专注于改进预训练的视觉生成器以更好地遵循人类意图。努力解决基础视觉生成器固有的各种挑战。这些包括提高空间控制不稳定性(Zhang和Agrawala,2023;Yang等,2023b)、确保更好地遵循文本提示(Black等,2023)、支持灵活的基于文本的编辑(Brooks等,2023)以及促进视觉概念定制化(Ruiz等,2023)等。
(3)、General-purpose Interface通用接口(为特定目的而设计):三个研究主题
(3)、General-purpose Interface. (Highlighted with blue in Figure 1.2) The aforementioned multi- modal foundation models are designed for specific purposes – tackling a specific set of CV prob- lems/tasks. Recently, we see an emergence of general-purpose models that lay the basis of AI agents. Existing efforts focus on three research topics. The first topic aims to unify models for vi- sual understanding and generation. These models are inspired by the unification spirit of LLMs in NLP, but do not explicitly leverage pre-trained LLM in modeling. In contrast, the other two topics embrace and involve LLMs in modeling, including training and chaining with LLMs, respectively.
(在图1.2中以蓝色突出显示)上述多模态基础模型是为特定目的而设计的,旨在解决一组特定的CV问题/任务。最近,我们看到了通用模型的出现,为AI代理打下了基础。现有的努力集中在三个研究主题上。第一个主题旨在统一用于视觉理解和生成的模型。这些模型受到NLP中LLMs统一精神的启发,但没有明确地在建模中利用预训练的LLM。相反,其他两个主题在建模中涵盖和涉及LLMs,包括与LLMs的训练和链接。
用于理解和生成的统一视觉模型:采用统一的模型体系结构(如CLIP/GLIP/OpenSeg)、统一不同粒度级别的不同VL理解任务(如IO统一方法/Unified-IO/GLIP-v2/X-Decoder)、具互动性和提示性(如SAM/SEEM )
>>Unified vision models for understanding and generation. In computer vision, several at- tempts have been made to build a general-purpose foundation model by combining the func- tionalities of specific-purpose multimodal models. To this end, a unified model architecture is adopted for various downstream computer vision and vision-language (VL) tasks. There are different levels of unification. First, a prevalent effort is to bridge vision and language by converting all closed-set vision tasks to open-set ones, such as CLIP (Radford et al., 2021), GLIP (Li et al., 2022f), OpenSeg (Ghiasi et al., 2022a), etc. Second, the unification of differ- ent VL understanding tasks across different granularity levels is also actively explored, such as I/O unification methods like UniTAB (Yang et al., 2021), Unified-IO (Lu et al., 2022a)), Pix2Seq-v2 (Chen et al., 2022d) and functional unification methods like GPV (Gupta et al., 2022a), GLIP-v2 (Zhang et al., 2022b)) and X-Decoder (Zou et al., 2023a). In the end, it is also necessitated to make the models more interactive and promptable like ChatGPT, and this has been recently studied in SAM (Kirillov et al., 2023) and SEEM (Zou et al., 2023b).
>>用于理解和生成的统一视觉模型。在计算机视觉领域,已经有几个尝试通过结合特定用途的多模态模型的功能来构建通用基础模型。为此,对各种下游计算机视觉和视觉语言(VL)任务采用统一的模型体系结构。有不同层次的统一。首先,一个普遍的努力是通过将所有封闭集视觉任务转化为开放集视觉任务来架起视觉和语言的桥梁,例如CLIP(Radford等,2021)、GLIP(Li等,2022f)、OpenSeg(Ghiasi等,2022a)等。其次,还积极探索了不同粒度级别的不同VL理解任务的统一,如I/O统一方法(如UniTAB(Yang等,2021)、Unified-IO(Lu等,2022a))和功能统一方法(如GPV(Gupta等,2022a)、GLIP-v2(Zhang等,2022b)和X-Decoder(Zou等,2023a)等。最后,还需要使模型像ChatGPT一样更具互动性和提示性,最近在SAM (Kirillov et al., 2023)和SEEM (Zou et al., 2023b)中对此进行了研究。
与LLMs的训练(遵循指令+能力扩展到多模态+端到端的训练):如Flamingo和Multimodal GPT-4
>>Training with LLMs. Similar to the behavior of LLMs, which can address a language task by following the instruction and processing examples of the task in their text prompt, it is desirable to develop a visual and text interface to steer the model towards solving a multimodal task. By extending the capability of LLMs to multimodal settings and training the model end-to-end, multimodal LLMs or large multimodal models are developed, including Flamingo (Alayrac et al., 2022) and Multimodal GPT-4 (OpenAI, 2023a).
>>与LLMs的训练。与LLMs的行为类似,LLMs可以通过遵循指令和处理任务的文本提示中的示例来解决语言任务,因此有必要开发一个视觉和文本接口,以引导模型解决多模态任务。通过将LLMs的能力扩展到多模态设置并进行端到端的训练,开发了多模态LLMs或大型多模态模型,包括Flamingo(Alayrac等,2022)和Multimodal GPT-4(OpenAI,2023a)等。
与LLMs链接的工具(集成LLMs和多模态的基础模型):如Visual ChatGPT/MM-REACT
>>Chaining tools with LLM. Exploiting the tool use capabilities of LLMs, an increasing num- ber of studies integrate LLMs such as ChatGPT with various multimodal foundation models to facilitate image understanding and generation through a conversation interface. This interdis- ciplinary approach combines the strengths of NLP and computer vision, enabling researchers to develop more robust and versatile AI systems that are capable of processing visual informa- tion and generating human-like responses via human-computer conversations. Representative works include Visual ChatGPT (Wu et al., 2023a) and MM-REACT (Yang* et al., 2023).
>>与LLMs链接的工具。越来越多的研究将LLMs(如ChatGPT)与各种多模态基础模型集成起来,通过对话接口促进图像理解和生成。这种跨学科方法结合了NLP和计算机视觉的优势,使研究人员能够开发出更强大、更通用的人工智能系统,能够处理视觉信息,并通过人机对话产生类似人类的反应。代表性作品包括Visual ChatGPT(Wu等,2023a)和MM-REACT(Yang*等,2023)。
1.2、Definition and Transition from Specialists to General-Purpose Assistants定义和从专家到通用助手的过渡
两类多模型基础模型:特定目的的预训练视觉模型=视觉理解模型(CLIP/SimCLR/BEiT/SAM)+视觉生成模型(Stable Diffusion)、通用助手=统一架构的通才+遵循人类指令
‌Based on the model development history and taxonomy in NLP, we group multimodal foundation models in Figure 1.2 into two categories.
>> Specific-Purpose Pre-trained Vision Models cover most existing multimodal foundation mod- els, including visual understanding models (e.g., CLIP (Radford et al., 2021), SimCLR (Chen et al., 2020a), BEiT (Bao et al., 2022), SAM (Kirillov et al., 2023)) and visual generation models (e.g., Stable Diffusion (Rombach et al., 2021; sta, 2022)), as they present powerful transferable ability for specific vision problems.
>> General-Purpose Assistants refer to AI agents that can follow human intents to complete various computer vision tasks in the wild. The meanings of general-purpose assistants are two-fold: (i) generalists with unified architectures that could complete tasks across different problem types, and (ii)easy to follow human instruction, rather than replacing humans. To this end, several research topics have been actively explored, including unified vision modeling (Lu et al., 2022a; Zhang et al., 2022b; Zou et al., 2023a), training and chaining with LLMs (Liu et al., 2023c; Zhu et al., 2023a; Wu et al., 2023a; Yang* et al., 2023).
基于NLP中的模型发展历史和分类,我们将多模态基础模型分为两类,如图1.2所示。
>>特定目的的预训练视觉模型包括大多数现有的多模态基础模型,包括视觉理解模型(例如CLIP(Radford等,2021)、SimCLR(Chen等,2020a)、BEiT(Bao等,2022)、SAM(Kirillov等,2023))和视觉生成模型(例如Stable Diffusion(Rombach等,2021;sta,2022)),因为它们具有针对特定视觉问题的强大的可转移能力。
>>通用助手是指能够跟随人类意图在域外完成各种计算机视觉任务的AI代理。通用助手的含义有两层:
(i)具具有统一架构的通才,可以完成不同问题类型的任务,
(ii)易于遵循人类指令,而不是取代人类。为此,已经积极探讨了几个研究主题,包括统一的视觉建模(Lu等,2022a;Zhang等,2022b;Zou等,2023a)、与LLMs的培训和链接(Liu等,2023c;Zhu等,2023a;Wu等,2023a;Yang*等,2023)。
1.3、Who Should Read this Paper?谁应该阅读本文?
This paper is based on our CVPR 2023 tutorial,2 with researchers in the computer vision and vision- language multimodal communities as our primary target audience. It reviews the literature and explains topics to those who seek to learn the basics and recent advances in multimodal foundation models. The target audiences are graduate students, researchers and professionals who are not ex- perts of multimodal foundation models but are eager to develop perspectives and learn the trends in the field. The structure of this paper is illustrated in Figure 1.3. It consists of 7 chapters.
本文基于我们的CVPR 2023教程,以计算机视觉和视觉语言多模态社区的研究人员为主要目标受众。它回顾了文献并向那些寻求学习多模态基础模型的基础知识和最新进展的人解释了主题。目标受众是研究生,研究人员和专业人士,他们不是多模态基础模型的专家,但渴望发展观点和了解该领域的趋势。本文的结构如图1.3所示。全文共分七章。
‌>> Chapter 1 introduces the landscape of multimodal foundation model research, and presents a his- torical view on the transition of research from specialists to general-purpose assistants.
>> Chapter 2 introduces different ways to consume visual data, with a focus on how to learn a strong image backbone.
>> Chapter 3 describes how to produce visual data that aligns with human intents.
>> Chapter 4 describes how to design unified vision models, with an interface that is interactive and promptable, especially when LLMs are not employed.
>> Chapter 5 describes how to train an LLM in an end-to-end manner to consume visual input for understanding and reasoning.
>> Chapter 6 describes how to chain multimodal tools with an LLM to enable new capabilities.
>> Chapter 7 concludes the paper and discusses research trends.
>>第1章介绍了多模态基础模型研究的背景,并提供了从专家到通用助手的研究过渡的历史视图。
>>第2章介绍了使用视觉数据的不同方法,重点介绍了如何学习强大的图像主干。
>>第3章描述了如何生成与人类意图一致的视觉数据。
>>第4章描述了如何设计统一的视觉模型,具有交互和可提示的接口,特别是在不使用LLMs的情况下。
>>第5章描述了如何以端到端的方式训练LLMs以使用视觉输入进行理解和推理。
>>第6章描述了如何将多模态工具与LLMs链接以实现新的能力。
>>第7章全文进行了总结,并对研究趋势进行了展望。
Figure 1.3: An overview of the paper’s structure, detailing Chapters 2-6.
Relations among Chapters 2-6. 第2-6章之间的关系。
Chapter 2-6 are the core chapters of this survey paper. An overview of the structure for these chapters are provided in Figure 1.2. We start with a discus- sion of two typical multimodal foundation models for specific tasks, including visual understanding in Chapter 2 and visual generation in Chapter 3. As the notion of multimodal foundation models are originally based on visual backbone/representation learning for understanding tasks, we first present a comprehensive review to the transition of image backbone learning methods, evolving from early supervised methods to the recent language-image contrastive methods, and extend the discussion on image representations from image-level to region-level and pixel-level (Chapter 2). Recently, generative AI is becoming increasingly popular, where vision generative foundation models have been developed. In Chapter 3, we discuss large pre-trained text-to-image models, and various ways that the community leverage the generative foundation models to develop new techniques to make them better aligned with human intents. Inspired by the recent advances in NLP that LLMs serve as general-purpose assistants for a wide range of language tasks in daily life, the computer vision com- munity has been anticipating and attempting to build general-purpose visual assistants. We discuss three different ways to build general-purpose assistants. Inspired by the spirit of LLMs, Chapter 4 focuses on unifying different vision models of understanding and generation without explicitly in- corporating LLMs in modeling. In contrast, Chapter 5 and Chapter 6 focus on embracing LLMs to build general-purpose visual assistants, by explicitly augmenting LLMs in modeling. Specifically, Chapter 5 describes end-to-end training methods, and Chapter 6 focuses on training-free approaches that chain various vision models to LLMs.
第2-6章是本调查论文的核心章节。这些章节的结构概述如图1.2所示。我们首先讨论针对特定任务的两个典型的多模态基础模型,包括第2章的视觉理解和第3章的视觉生成。由于多模态基础模型的概念最初是基于用于理解任务的视觉骨干/表示学习,因此我们首先对图像骨干学习方法的转变进行了全面回顾,从早期的监督方法到最近的语言-图像对比方法,并将对图像表示的讨论从图像级扩展到区域级和像素级(第2章)。最近,生成式人工智能变得越来越流行,其中开发了视觉生成基础模型。在第3章中,我们讨论了大型预训练的文本到图像模型,以及社区利用生成基础模型开发新技术以使其更好地与人类意图保持一致的各种方法。受到NLP领域最新进展的启发,LLMs在日常生活中可以作为各种语言任务的通用助手,计算机视觉社区一直在期望并尝试构建通用视觉助手。我们讨论了三种不同的构建通用助手的方式。受LLMs的精神启发,第4章侧重于统一不同的理解和生成视觉模型,而不在建模中明确地包含LLMs。相反,第5章和第6章侧重于通过显式地在建模中增加LLMs来构建通用的视觉助手。具体来说,第5章描述了端到端训练方法,第6章侧重于无需训练即可将各种视觉模型链接到LLMs的方法。
How to read the paper. 如何阅读本文
Different readers have different backgrounds, and may have different purposes of reading this paper. Here, we provide a few guidance.
>> Each chapter is mostly self-contained. If you have a clear goal and a clear research direction that you want to focus on, then just jump to the corresponding chapter. For example, if you are interested in building a mini prototype using OpenAI’s multimodal GPT-4, then you can directly jump to Chapter 5.
>> If you are a beginner of multimodal foundation models, and are interested in getting a glimpse of the cutting-edge research, we highly recommend that you read the whole paper chapter by chapter in order, as the early chapters serve as the building blocks of later chapters, and each chapter provides the description of the key concepts to help you understand the basic ideas, and a comprehensive literature review that to help you grasp the landscape and state of the art.
>> If you already have rich experience in multimodal foundation models and are familiar with the literature, feel free to jump to specific chapters you want to read. In particular, we include in most chapters a section to discuss advanced topics and sometimes provide our own perspectives, based on the up-to-date literature. For example, in Chapter 6, we discuss several important aspects of multimodal agents in tool use, including tool creation and its connection to retrieval-augmented methods.
不同的读者有不同的背景,可能阅读本文的目的也不同。在这里,我们提供一些建议。
>> 每一章基本上都是独立的。如果你有一个明确的目标和一个你想要关注的明确的研究方向,那么就直接跳到相应的章节。例如,如果您对使用OpenAI的多模式GPT-4构建迷你原型感兴趣,那么您可以直接跳到第5章。
>> 如果您是多模态基础模型的初学者,并且有兴趣了解前沿研究的概貌,我们强烈建议您按顺序逐章阅读整篇文章,因为早期章节是后续章节的基石,每章提供了关键概念的描述,帮助您理解基本思想,并提供了全面的文献回顾,帮助您掌握领域的概貌和最新技术。
>> 如果您已经在多模态基础模型领域拥有丰富的经验,并且熟悉相关文献,可以随意跳转到您想要阅读的特定章节。特别是在大多数章节中,我们都包含了讨论高级主题的部分,有时还根据最新文献提供了我们自己的观点。例如,在第6章中,我们讨论了多模态工具在工具使用中的几个重要方面,包括工具的创建以及与检索增强方法的关联。
1.4、Related Materials: Slide Decks and Pre-recorded Talks相关资料:幻灯片和预录演讲
This survey paper extends what we present in the CVPR 2023 tutorial by covering the most recent advances in the field. Below, we provide a list of slide decks and pre-recorded talks, which are related to the topics in each chapter, for references.
Chapter 2: Visual and Vision-Language Pre-training (Youtube, Bilibili)
Chapter 3: Alignments in Text-to-Image Generation (Youtube, Bilibili)
Chapter 4: From Representation to Interface: The Evolution of Foundation for Vision Under- standing (Youtube, Bilibili)
Chapter 5: Large Multimodal Models (Youtube, Bilibili)
Chapter 6: Multimodal Agents: Chaining Multimodal Experts with LLMs (Youtube, Bilibili)
本调查论文扩展了我们在CVPR 2023教程中提出的内容,涵盖了该领域最新的进展。以下是与每章主题相关的幻灯片和预录演讲的列表,供参考。
第2章:视觉和视觉语言预训练(Youtube,Bilibili)
第3章:文本到图像生成中的对齐(Youtube,Bilibili)
第4章:从表示到接口:视觉理解基础的演变(Youtube,Bilibili)
第5章:大型多模态模型(Youtube,Bilibili)
第6章:多模态智能体:将多模态专家与LLMs链接(Youtube,Bilibili)
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
文档图像处理:大模型的突破与新探索
多模态AI崛起,2022年人工智能5大发展趋势
OpenAnnotate3D:一个目标取代人类的标注工具
华为云田奇团队联合上海交大提出最强视觉编码器 | COMM: CLIP DINOv2的完美融合
通用检测大模型 | 华科白翔团队提出以对象为中心的基础模型GLEE
吴恩达:从文本AI革命到视觉AI大变革
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服