Authors - Nway Nway Zaw Win, Aye Nyein Mon, Win Lelt Lelt Phyu Abstract - Generating natural language descriptions for visual content is a key task bridging Computer Vision and Natural Language Processing. Conventional CNN-based approaches often struggle to capture global contextual information, limiting semantic consistency. This paper presents a multimodal video captioning framework for Myanmar-script generation based on a Vision Transformer (ViT) encoder and a Gated Recurrent Unit (GRU) decoder. Global visual representations are derived from transformer-based self-attention, while a class-prefixing mechanism is introduced to improve semantic grounding in a low-resource language setting. Experimental results evaluated using BLEU, CHRF, and TER metrics demonstrate that the proposed ViT–GRU model outperforms CNN–RNN baselines. PCA and t-SNE visualizations further confirm the effectiveness of transformer-based visual representations.