Israeli tech company Lightricks recently announced the release of its latest audiovisual synthesis system, LTX-2. The system features extremely high computational efficiency, capable of directly generating high-definition video content lasting 20 seconds with perfectly synchronized audio and video based on brief text descriptions.
Different from traditional visual synthesis methods, LTX-2 breaks through the bottleneck of "first image, then voice" processing order. The development team pointed out that traditional audio-visual decoupling processes cannot reproduce the natural distribution of real environments. Therefore, LTX-2 adopts a complex dual-stream parallel computing architecture, using 19 billion computational parameters to collaboratively process visual and acoustic environments. Among these, 1.4 billion parameters are allocated for video stream processing, while 5 billion are for audio stream processing, this asymmetric distribution precisely simulates the density differences between visual and auditory information in real life.

In practical performance testing, the system demonstrated astonishing synthesis speed. In mainstream enterprise-level graphics card environments, generating a 720p resolution audiovisual content takes only 1.22 seconds per step. Data shows that its operational efficiency can reach up to 18 times that of similar products. At the same time, the 20-second generation limit also surpasses similar tools from Google and other major laboratories.
To accurately understand complex language instructions, the system integrates a multilingual text parsing engine and introduces a "preprocessing buffer" mechanism, allowing the system sufficient space to parse logic before executing the final synthesis. Through a unique cross-association mechanism, the system can accurately match the moment of object collisions in the image with corresponding acoustic effects.

Despite its technological leadership, the development team also admitted that the system occasionally experiences voice attribution errors when handling rare dialects or multi-character dialogues. Long sequences over 20 seconds still face challenges with micro-shifts in the timeline.
Ziv Faberman, founder of Lightricks, stated that choosing to open-source the system code rather than keeping it as a closed service was based on considerations regarding "technological control." He believes that content creators should control the technology on their own hardware, rather than outsourcing decision-making power to a few interest groups. Currently, the complete code and training framework of the system have been released on an open platform and have been deeply optimized for the latest consumer-grade high-performance graphics cards.




