NVIDIA research team has officially released a full-duplex speech-to-speech dialogue model named PersonaPlex-7B-v1. This model completely breaks the traditional AI voice assistant "listen once, respond once" rigid pattern, aiming to achieve a more natural conversation experience closer to human interactions.

image.png

Unlike previous architectures that required multiple stages such as ASR (speech-to-text), LLM (large language model), and TTS (text-to-speech), PersonaPlex uses a single Transformer architecture to complete the entire process of speech understanding and generation. AIbase learned that this "end-to-end" design significantly reduces response latency and enables AI to handle natural interruptions, overlapping speech, and immediate feedback. In simple terms, it's like real human conversation; the AI listens continuously while speaking, and even if the user suddenly interrupts, it can quickly respond.

Additionally, the model performs excellently in personalization control. Through the dual guidance of "speech + text," users not only define the AI's role background but also precisely control its tone and intonation. AIbase learned that NVIDIA combined massive real call data with synthetic scenarios during training, allowing the model to have natural language habits while strictly adhering to specific industry business rules. Current evaluation results show that PersonaPlex-7B-v1 outperforms most open-source and closed-source systems in dialogue fluency and task completion rate.

Research: https://research.nvidia.com/labs/adlr/personaplex/

Key Points:

  • 🎙️ Full-duplex Interaction: PersonaPlex-7B-v1 supports real-time speech stream processing, allowing users to interject or overlap conversations while the AI is speaking, achieving rapid response.

  • 🧠 Single Model Architecture: It abandons the complicated plugin pipeline and uses a single Transformer structure to simultaneously predict text and speech tokens, improving the naturalness of dialogue from the ground up.

  • 🎭 Deep Personalization: It supports system prompts of up to 200 tokens and specific speech embeddings, enabling flexible customization of the AI's personality, business knowledge, and emotional tone.