Vol. 1, Issue 1, Part A (2024)
Deep learning for image captioning: A comparative study
Élodie Moreau and Marc Delacroix
The present study investigates the performance and effectiveness of various deep learning architectures in generating accurate and semantically rich textual descriptions from images. The research systematically compares Convolutional Neural Network-Recurrent Neural Network (CNN-RNN) frameworks, attention-based models, and Transformer-based architectures using standardized datasets such as MSCOCO, Flickr8k, and Flickr30k. Each model was evaluated under uniform preprocessing and training protocols, employing performance metrics including BLEU, METEOR, CIDEr, and SPICE to ensure consistency and fairness. The results revealed that while traditional CNN-LSTM architectures serve as a solid baseline, they are limited by their inability to capture intricate contextual and relational semantics. In contrast, attention-based architectures significantly improved performance by enabling models to focus on salient image regions, leading to more coherent and contextually aligned captions. Transformer models, particularly Meshed-Memory and Object-Aware variants, achieved the highest scores across all metrics, reflecting their superior capacity for global context modeling and object-level reasoning. Statistical analyses confirmed that the differences between model performances were highly significant, validating the proposed hypothesis. Furthermore, qualitative evaluations demonstrated that Transformer-based models produced captions closer to human-like descriptions with greater fluency and semantic accuracy. The findings emphasize that the synergy between visual feature extraction, attention design, and sequence-level optimization defines the overall success of image captioning systems. The study concludes by proposing practical recommendations for integrating these models into real-world applications such as assistive technologies, digital content generation, autonomous systems, and multimedia information retrieval. This research contributes a comprehensive experimental framework and a set of empirically grounded insights that can guide future advancements in multimodal deep learning for vision-language integration.
Pages: 44-48 | 3 Views 2 Downloads