Loading…
Tuesday June 23, 2026 11:00am - 1:00pm PST

Authors - Nway Nway Zaw Win, Aye Nyein Mon, Win Lelt Lelt Phyu
Abstract - Generating natural language descriptions for visual content is a key task bridging Computer Vision and Natural Language Processing. Conventional CNN-based approaches often struggle to capture global contextual information, limiting semantic consistency. This paper presents a multimodal video captioning framework for Myanmar-script generation based on a Vision Transformer (ViT) encoder and a Gated Recurrent Unit (GRU) decoder. Global visual representations are derived from transformer-based self-attention, while a class-prefixing mechanism is introduced to improve semantic grounding in a low-resource language setting. Experimental results evaluated using BLEU, CHRF, and TER metrics demonstrate that the proposed ViT–GRU model outperforms CNN–RNN baselines. PCA and t-SNE visualizations further confirm the effectiveness of transformer-based visual representations.
Paper Presenter
Tuesday June 23, 2026 11:00am - 1:00pm PST
Virtual Room B Manila, Philippines

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link