Vol. 5 No. 2 (2026)
Articles

A Survey on Multimodal Foundation Models: Architectures, Training Paradigms, and Emerging Applications

Published 2026-02-28

How to Cite

Whitaker, O. (2026). A Survey on Multimodal Foundation Models: Architectures, Training Paradigms, and Emerging Applications. Journal of Computer Technology and Software, 5(2). Retrieved from https://www.ashpress.org/index.php/jcts/article/view/248

Abstract

Multimodal foundation models have emerged as a transformative paradigm in artificial intelligence by enabling unified learning across heterogeneous data modalities such as images, text, audio, and sensor signals. Unlike traditional unimodal learning systems, multimodal models are capable of integrating diverse information sources to perform complex perception and reasoning tasks that more closely resemble human cognitive processes. Recent advances in large-scale pretraining, transformer architectures, and cross-modal representation learning have significantly accelerated the development of multimodal models capable of performing a wide range of tasks including visual question answering, image captioning, multimodal dialogue, and embodied reasoning. This paper presents a comprehensive survey of multimodal foundation models, focusing on architectural design principles, training paradigms, and emerging application domains. We review representative multimodal architectures, including transformer-based cross-modal fusion frameworks and vision-language models that integrate perception with language reasoning capabilities. The survey further examines key training strategies such as contrastive learning, multimodal pretraining, and instruction-based alignment methods that enable effective cross-modal representation learning. Finally, we discuss emerging applications of multimodal foundation models in domains such as healthcare, robotics, and intelligent interactive systems, and outline key challenges and future research directions for developing more reliable, scalable, and general multimodal artificial intelligence systems.