OpenAI's advanced voice mode of GPT-4o has undergone significant updates recently. It can now engage in more natural voice interactions and has added the impressive "singing" function. Although the current singing performance still appears somewhat immature, this breakthrough undoubtedly opens up new possibilities for AI's multimodal interaction capabilities. AIbase consolidates the latest information to analyze the recent developments and potential of GPT-4o's voice mode.

image.png

Singing Function Launched: AI Can Also “Sing”

The latest news shows that GPT-4o’s advanced voice mode now supports the singing function. Users can request AI to sing songs through voice commands, including some copyrighted tracks. This function allows GPT-4o to generate melodies, lyrics, or imitate specific styles of singing according to user needs, adding fun to the interactive experience. Although the "performance" still needs optimization, AIbase observes that the addition of this function marks a new attempt by GPT-4o in the field of audio generation.

Multimodal Interaction Upgraded: More Natural and Emotional

GPT-4o's advanced voice mode is renowned for its end-to-end voice processing capability. Compared to traditional voice modes (which rely on converting speech to text before generating speech), the new mode directly processes audio input, significantly reducing response delays, averaging only 320 milliseconds. Additionally, GPT-4o can capture non-verbal cues such as speaking speed and tone, and respond with richer emotional voices. It even supports users interrupting conversations at any time, providing a natural conversation experience close to human interaction.

Feature Highlights: All-Round Mastery of Laughter and Crying

Besides singing, GPT-4o's advanced voice mode can also generate laughter, crying, and other emotional expressions based on instructions, further enriching interaction scenarios. For example, users can ask AI to respond in a dramatized, humorous, or specific character's tone, such as mimicking the voice of an animated character or celebrity. This flexibility gives it great potential in entertainment, education, and creative content generation fields.

Current Limitations: Singing Still Needs Refinement

Although the singing function has been added, GPT-4o's singing performance has not yet reached professional standards. During testing, AI may appear less fluid when handling complex melodies or high notes, and some users have reported that its voice quality compared to other AI voice models (such as Pi AI or Siri) seems slightly inferior, with lower sampling rates leading to slight compression of sound quality. OpenAI stated that the addition of the singing function aims to explore the boundaries of audio generation, and its performance will be continuously optimized in the future.

Security and Copyright Considerations: Limited Innovation

To respect copyrights, OpenAI has set strict filtering mechanisms for GPT-4o's voice output, limiting its generation of copyrighted music content. However, recent information shows that some users have successfully made AI sing copyrighted songs, triggering discussions about copyright boundaries. Moreover, GPT-4o has a high rejection rate in certain audio tasks (such as automatic singing scoring or voice synthesis), possibly due to avoiding the generation of unauthorized content or lacking objective standards.

A New Chapter for Voice AI

GPT-4o's advanced voice mode update, especially the addition of the singing function, marks continuous breakthroughs by OpenAI in the field of multimodal AI. Although the current singing performance needs improvement, its low latency, natural interaction, and emotional expression capabilities are already significantly ahead of traditional voice assistants like Siri and Alexa. AIbase believes that as OpenAI continues to optimize sound quality and copyright processing mechanisms, GPT-4o has the potential to spark a new wave of applications in education, entertainment, and customer service fields.

Conclusion

GPT-4o's advanced voice mode singing function injects more fun and possibilities into AI interaction. Despite the need for technological refinement, its innovative significance cannot be ignored. From low-latency dialogues to emotional expression, GPT-4o is redefining the boundaries of human-computer interaction.