THE RISE OF MULTIMODAL AI: A QUICK REVIEW OF GPT-4V AND GEMINI

Authors

  • Jamil Ahmed
  • Ghalib Nadeem
  • Muhammad Kashif Majeed
  • Rashid Ghaffar
  • Abdul Karim Kashif Baig
  • Syed Raheem Shah
  • Rana Abdul Razzaq
  • Talha Irfan

Keywords:

Keywords (Multimodal AI, GPT-4V, Gemini 1.5, Artificial General Intelligence (AGI), AI applications

Abstract

Multimodal artificial intelligence (AI) systems— interpreting, synthesizing and reasoning heterogeneously over text, images, audio and video—represent a transformational boundary in AI research and application today. Some notable achievements in this area are Open AI GPT-4V (Vision) and Google DeepMind’s Gemini 1.5, both exemplifying the current coups of cross-modal representation learning and generative reasoning. This paper remarks critically and succinctly on these two flagship models, studying their architecture, modality fusion, functionality, and performance metrics. Emphasis is placed upon their performance towards visual question answering, multimodal dialogue, instruction following, and other tasks that are reasoning integrated because intelligence and perception working in harmony are needed. Moreover, we examine GPT-4V and Gemini 1.5 from the lenses of model size, scaling, fine-tuning, alignment, and generalization in downstream tasks. The debate looks at the major outstanding issues of multimodal AI: hallucinations, no interpretability, high computational cost, and others which remain the most important barriers to wider use and trust. Finally, we study the far-reaching effects

Downloads

Published

2025-06-21

How to Cite

Jamil Ahmed, Ghalib Nadeem, Muhammad Kashif Majeed, Rashid Ghaffar, Abdul Karim Kashif Baig, Syed Raheem Shah, Rana Abdul Razzaq, & Talha Irfan. (2025). THE RISE OF MULTIMODAL AI: A QUICK REVIEW OF GPT-4V AND GEMINI. Spectrum of Engineering Sciences, 3(6), 778–786. Retrieved from https://www.sesjournal.com/index.php/1/article/view/506