The TEN Agent team recently announced the official open-source release of its core models **TEN Voice Activity Detection (VAD)** and **TEN Turn Detection**, providing strong technical support for building real-time, multimodal speech AI agents.

This move marks a significant advancement in the TEN framework's efforts to promote the democratization of speech interaction technology and open-source collaboration. Below is the latest update compiled by AIbase, offering an in-depth analysis of the functions, advantages, and potential impact of these two core models on the industry.

image.png

TEN VAD: Low-latency, High-performance Voice Activity Detection

TEN VAD is a real-time voice activity detector designed for enterprise-level applications, known for its low latency, lightweight design, and high performance. According to official information and social media feedback, TEN VAD can detect voice activity at the frame level with remarkable precision, significantly outperforming commonly used industry models such as WebRTC VAD and Silero VAD. Here are its key highlights:

- **Low computational complexity**: The TEN VAD library is small in size and has low computational complexity, supporting cross-platform C language compatibility, covering multiple operating systems such as Linux x64, Windows, macOS, Android, and iOS. It also provides Python bindings for Linux x64 and WASM support for the web.[](https://huggingface.co/TEN-framework/ten-vad)

- **High accuracy and low latency**: Compared to Silero VAD, TEN VAD has lower latency in detecting transitions from speech to non-speech, allowing it to quickly identify short pauses, making it suitable for real-time interactive scenarios. Tests show that its real-time factor (RTF) performs excellently across various CPU platforms.[](https://huggingface.co/TEN-framework/ten-vad)

- **Latest open-source progress**: In June 2025, the TEN team open-sourced the ONNX model and preprocessing code, enabling deployment on any platform and hardware architecture that supports ONNX, further enhancing flexibility. Additionally, the support for WASM+JS expands its application possibilities on the web.

On social media, developers have highly recognized the open-source release of TEN VAD, believing that its performance surpasses traditional VAD models, providing a powerful tool for real-time voice assistant development.

TEN Turn Detection: Intelligent Dialogue Turn Management

**TEN Turn Detection** is an intelligent turn detection model designed for full-duplex voice communication, aiming to solve one of the most challenging issues in human-computer dialogue: accurately determining when a user ends their speech and performing context-aware interruption handling. Here are its key features:

- **Semantic analysis capabilities**: Based on the Qwen2.5-7B Transformer model, TEN Turn Detection precisely distinguishes between "completed," "waiting," and "unfinished" states of user speech by analyzing the semantic context and language patterns of the conversation. For example, it can recognize "Hey, I want to ask a question..." as an unfinished statement, thus avoiding unnecessary AI interruptions.[](https://huggingface.co/TEN-framework/TEN_Turn_Detection)

- **Multilingual support**: Currently supports English and Chinese, accurately identifying turn signals in multilingual conversations, suitable for global application scenarios.[](https://huggingface.co/TEN-framework/TEN_Turn_Detection)

- **Excellent performance**: On public test datasets, TEN Turn Detection outperforms other open-source turn detection models in all metrics, especially excelling in dynamic real-time conversations.[](https://huggingface.co/TEN-framework/TEN_Turn_Detection)

- **Natural interaction experience**: Combined with TEN VAD, TEN Turn Detection enables AI agents to wait for appropriate speaking opportunities or handle user interruptions in the right context, creating a more natural conversational experience.[](https://www.agora.io/en/blog/making-voice-ai-agents-more-human-with-ten-vad-and-turn-detection/)

TEN Agent Ecosystem: The Foundation of Multimodal Real-time AI

TEN Agent is a showcase project of the TEN framework, integrating core components such as TEN VAD and TEN Turn Detection, supporting multimodal real-time interactions including voice, video, and text. Here are its roles within the ecosystem:

- **Seamless integration**: As plugins of the TEN framework, TEN VAD and TEN Turn Detection allow developers to easily integrate them into the voice agent development process through simple configuration, supporting integration with services like Deepgram and ElevenLabs.

- **Multi-scenario applications**: TEN Agent supports a wide range of use cases, from intelligent customer service and real-time translation to virtual companions. For example, combined with the Google Gemini multimodal API, TEN Agent can enable real-time visual and screen-sharing detection, expanding its applications in fields such as education and healthcare.

- **Open-source collaboration**: All components of the TEN framework (except part of the TEN VAD code) are fully open-sourced, encouraging community developers to contribute code, fix bugs, or suggest new features. The TEN team provides collaboration channels via GitHub Issues and Projects, attracting widespread developer participation

Project: https://github.com/TEN-framework/ten-framework