Research AreasMultiModal Foundation Models

Foundation Models represent a central research focus at the INSIGHT Lab, targeting multimodal learning across vision, language, and graph data. Our lab tackles key theoretical and practical challenges, particularly addressing issues of robustness, adaptability, and interpretability when generalizing across varied tasks and domains.

Vision-Language Models (VLMs)

We focus on strengthening multimodal representation learning, reasoning capabilities, and interpretability in VLMs. These efforts aim to improve tasks like image captioning, visual question answering, and multimodal retrieval. By leveraging robust theoretical insights and innovative algorithms, we develop VLMs that generalize effectively across diverse real-world applications.

Multimodal Integration

Our researchers explore innovative ways to integrate graph-based models with sequence-based language models, bridging gaps through theoretically sound methods that enhance complex reasoning and multimodal inference. We emphasize robust pretraining strategies to ensure learned representations generalize effectively across domain shifts and diverse attribute configurations.

Adversarial Robustness

We address vulnerabilities of multimodal foundation models handling visual data, graph structures, large language models (LLMs), and vision-language models (VLMs). Inspired by recent advancements, we develop methods to detect, analyze, and mitigate adversarial attacks across diverse modalities and tasks, significantly enhancing model security and reliability.