Achieves reliable video narration by utilizing an equal-distance relationship between visual and language tokens.