OmniTalker is a unified framework proposed by Alibaba's Tongyi Lab with the aim of generating audio and video in real time to enhance human-computer interaction experiences. Its innovation lies in solving common issues in traditional text-to-speech and speech-driven video generation methods, such as out-of-sync audio-video, inconsistent styles, and system complexity. OmniTalker adopts a dual-branch diffusion transformer architecture, achieving high-fidelity audio-video outputs while maintaining efficiency. Its real-time inference speed reaches 25 frames per second, making it suitable for various interactive video chat applications and enhancing user experiences.