Translated data: The Peking University Mengchen Team has open-sourced the Video-LLaVA large model, enabling instant comprehension of comedic elements in funny videos. The model has achieved advanced performance on multiple benchmarks, requiring no paired data, and understands both images and videos through a unified visual feature space. Comparative experiments show that pre-aligned visual representations enhance performance in video question-answering tasks. Joint training on video data benefits the model in both image and video understanding tasks.