Loading…
Wednesday June 24, 2026 5:00pm - 7:00pm PST

Authors - Md. Monowar Hossain, Fahima Hossain, Md. Shahidul Islam, Md. Tanvir Ahmed, Reduan Ahmed
Abstract - This automated image captioning is on one hand a Computer Vision (CV) and Natural Language Processing (NLP) application, but on the other hand, conventional CNN-RNN models suffer from feature loss and long-range dependency. The proposed model in this study is a parameter balanced multi-modal model that consists of a dual-encoder network which combines Effi-cientNet-B4 for hierarchical features and MobileNetV2 for geometric efficiency, as well as a multi-head Transformer decoder. The model was evaluated on Flickr8k, and tested with a dynamic scalar weight mechanism and teacher-forced optimization, the BLEU-1 was 0.5774 and METEOR was 0.4129. Interestingly, the ablation results also showed that although the dual-encoder method is competitive, the pathway of the standalone MobileNetV2 is slightly better than the fused pathway in terms of BLEU-4 (0.2284 vs. 0.20). This indicates that the pathway may be redundant during the concatenation process. This study validates the possibility of using Transformer decoders instead of RNN bottle-necks and offers important considerations for the optimization of real-time feature fusion for vision tasks.
Paper Presenter
avatar for Md. Tanvir Ahmed
Wednesday June 24, 2026 5:00pm - 7:00pm PST
Virtual Room B Manila, Philippines

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link