THE RISE OF MULTIMODAL AI: A QUICK REVIEW OF GPT-4V AND GEMINI

Jamil Ahmed; Ghalib Nadeem; Muhammad Kashif Majeed; Rashid Ghaffar; Abdul Karim Kashif Baig; Syed Raheem Shah; Rana Abdul Razzaq; Talha Irfan

Authors

Jamil Ahmed
Ghalib Nadeem
Muhammad Kashif Majeed
Rashid Ghaffar
Abdul Karim Kashif Baig
Syed Raheem Shah
Rana Abdul Razzaq
Talha Irfan

Keywords:

Multimodal AI, GPT-4V, Gemini 1.5, Artificial General Intelligence AGI, AI applications

Abstract

Multimodal artificial intelligence (AI) systems— interpreting, synthesizing and reasoning heterogeneously over text, images, audio and video—represent a transformational boundary in AI research and application today. Some notable achievements in this area are Open AI GPT-4V (Vision) and Google DeepMind’s Gemini 1.5, both exemplifying the current coups of cross-modal representation learning and generative reasoning. This paper remarks critically and succinctly on these two flagship models, studying their architecture, modality fusion, functionality, and performance metrics. Emphasis is placed upon their performance towards visual question answering, multimodal dialogue, instruction following, and other tasks that are reasoning integrated because intelligence and perception working in harmony are needed. Moreover, we examine GPT-4V and Gemini 1.5 from the lenses of model size, scaling, fine-tuning, alignment, and generalization in downstream tasks. The debate looks at the major outstanding issues of multimodal AI: hallucinations, no interpretability, high computational cost, and others which remain the most important barriers to wider use and trust. Finally, we study the far-reaching effects