Large Multimodal Foundation Models

ECCV 2024 Tutorial, Milan, Italy


The current discourse on technological progress underscores the interconnected roles of large multimodal foundation models. The necessity for this integration becomes evident when considering the complex dynamics of real-world environments. For instance, an autonomous vehicle in urban settings should not rely solely on visual sensors for pedestrian detection but must also proficiently interpret and respond to auditory signals, such as vocalized warnings. Similarly, the amalgamation of visual data with linguistic context in robots promises more adaptive functionalities, especially in diverse settings. Acknowledging the rapid expansion of this field, the tutorial agenda will encompass the introduction of history, applications, and future directions tailored for multimodal learning. Addressing privacy concerns related to multimodal data and equally vital safety discussions will ensure that systems adeptly interpret and act upon both visual and linguistic inputs, minimizing potential mishaps in real-world scenarios.

Through a comprehensive examination of these topics, this tutorial seeks to foster a deeper academic understanding of the intersection among vision, language as well as other modalities within the context of large multimodal foundation models. By convening experts from interdisciplinary fields, our objective is to decipher current state-of-the-art methodologies, address challenges, and chart avenues for future endeavors in large multimodal foundation model research, ensuring our findings resonate within both academic and industrial communities.