Recently, the Tongyi Qwen team from Alibaba released an open-source Python command-line tool called Qwen3-ASR-Toolkit. This tool is designed to provide users with a more convenient audio and video transcription service, especially breaking the three-minute limit of the Qwen3-ASR-Flash API in terms of audio duration, enabling fast transcription for hours. The release of this new tool undoubtedly provides strong support for users who need large-scale audio transcription.
Qwen3-ASR-Flash is the latest speech recognition model in the Tongyi Qianwen series, trained on massive multimodal data and ASR data with a scale of tens of millions of hours. Its powerful performance provides users with high-accuracy speech recognition capabilities, allowing long audio and video content to be effectively transcribed into text, greatly improving work efficiency.
The Qwen3-ASR-Toolkit uses intelligent voice activity detection (VAD) technology to ensure the integrity of sentences during transcription. At the same time, the tool can automatically resample any sampling rate audio file to 16kHz mono to improve processing results. In addition, it supports multi-threaded parallel upload of segments, a feature that significantly reduces total processing time, making the user experience smoother during use.
In terms of supported media formats, Qwen3-ASR-Toolkit is based on FFmpeg, covering almost all mainstream audio and video formats, including mp4, mov, mkv, mp3, wav, m4a, etc., which allows users to flexibly choose file types when performing audio and video transcription without worrying about format compatibility issues.
github:https://github.com/QwenLM/Qwen3-ASR-Toolkit
Key Points:
📌 Alibaba's Tongyi launches Qwen3-ASR-Toolkit, breaking the time limit for audio transcription, supporting hour-level transcription.
🎤 This tool is based on the latest Qwen3-ASR-Flash model, ensuring high accuracy in speech recognition.
💻 Supports multiple audio and video formats, allowing users to choose flexibly and improve the efficiency of audio and video transcription.