AR glasses asked "What is the building opposite?" and the backend MLLM-SC framework generated a "semantic attention heatmap" within 10 ms: building outlines were marked in deep red, prioritized to the highest level, while other backgrounds were compressed. High-dimensional multimodal data no longer "use power evenly"—task-related pixels, speech, and coordinates are taken out by the semantic engine to take the "fast lane," while irrelevant content is automatically downgraded, freeing up 30% more available bandwidth on the 6G air interface.
This "device-edge" collaborative system embeds a large multimodal model into edge servers. When users input images, speech, and task requests, prompt engineering and context learning first decode the intent, then drive a dual-path semantic encoder—important features take the high-quality path, while secondary information enters the low-resolution channel. Even if the channel suddenly degrades, key areas remain 1080 P with fidelity. At the receiving end, VAE performs rough reconstruction, and conditional diffusion models do fine-tuning. It can also dynamically switch between "high-definition reconstruction" or "AI frame interpolation" modes based on terminal computing power, enabling real-time synthesis of high-quality holographic images even in poor network conditions.
In lab tests, AR navigation, immersive meetings, and vehicle networking 3D maps ran simultaneously on a 500 MHz millimeter-wave cell: with MLLM-SC, the average end-to-end delay dropped from 28 ms to 18 ms, and block error rate decreased by 42%. The team's next step is to integrate reinforcement learning into semantic decision-making, allowing multiple agents to "optimize strategies while communicating" in scenarios such as collaborative driving and city-level metaverse, aiming to further increase the "experience density" of 6G by an order of magnitude.





