AI Daily: AutoGLM agents can automatically order takeout; MinShen releases a major update of the Flux version ic-light model; ByteDance's PersonaTalk enables precise AI voiceovers

Welcome to the AI Daily section! Here is your daily guide to exploring the world of artificial intelligence. Every day, we bring you the latest hot topics in the AI field, focusing on developers, helping you understand technological trends and innovative AI product applications.

Discover Fresh AI Products Click to Learn More: https://top.aibase.com/

1. Zhipu AI Launches AutoGLM Agent: Input Commands to Simulate Human Mobile Operations

Zhipu Technology Team has recently launched a new product, AutoGLM, based on the research achievements of the GLM team. This is an intelligent agent capable of simulating human mobile operations to perform various tasks. The launch of AutoGLM marks a step forward in the field of "Phone Use" for artificial intelligence, making AI applications closer to people's daily lives.

AiBase Highlights:
🚀 AutoGLM is an agent launched by the Zhipu Technology Team based on GLM technology research, capable of simulating human mobile operations to perform tasks.
💡 AutoGLM has a wide range of applications and can complete various tasks on platforms like WeChat, Taobao, Ctrip, 12306, and Meituan without the need for complex workflows.
🔧 AutoGLM technology is based on a self-developed agent decoupling intermediate interface and an online course reinforcement learning framework, addressing challenges in task planning and action execution.
Details Link: https://xiao9905.github.io/AutoGLM

2. Minshen Releases Major Update for Flux Version of ic-light Model: 16-Channel VAE Breaks Performance Limits, Astonishing Detail Retention!

IC-Light V2, based on the Flux architecture, has emerged with revolutionary image processing breakthroughs. The 16-channel VAE and high-resolution features have taken its detail retention and accuracy to new heights, demonstrating excellent adaptability.

AiBase Highlights:
✨ Revolutionary image processing breakthrough: IC-Light V2 uses 16-channel VAE and high-resolution features, breaking performance limits with astonishing detail retention.
🌟 Multi-scenario adaptability: IC-Light V2 is a versatile tool capable of handling oil painting and anime-style images, maintaining the original essence and performing excellently.
💡 Powerful function support: IC-Light V2 features low-light processing and shadow adjustment capabilities, providing strong support for photography post-processing and professional image processing.
Details Link: https://github.com/lllyasviel/IC-Light/discussions/98

3. Farewell to Voice Actors? ByteDance's PersonaTalk Makes AI Accurate Dubbing, Even Facial Expressions Perfectly Replicated!

ByteDance's latest PersonaTalk AI model achieves precise video dubbing, with voice perfectly synchronized with mouth movements, retaining the original characteristics of the characters, making the video more realistic and natural. The model uses an attention mechanism two-stage framework, offering highly personalized dubbing effects and excellent visual quality. However, there are still limitations when dealing with non-human avatars and significant facial gestures. ByteDance plans to limit access to the core model to prevent technology misuse.

AiBase Highlights:
🔊 Voice synchronization with mouth movements: PersonaTalk ensures that the mouth movements of characters in the video match the new voice perfectly, achieving perfect synchronization.
👤 Retain character traits: PersonaTalk retains the original characteristics of the characters, including speaking style, face shape, and expressions, maintaining the realism of the video.
🤖 Applicable to different characters: PersonaTalk does not require extensive data to train each character individually, adapting to diverse scenarios, providing flexibility and convenience.
Details Link: https://grisoon.github.io/PersonaTalk/

4. Meta Open-Sources Long Video LLM Project LongVU: Can Filter Duplicate Frames Efficiently and Accurately Understand Long Video Content

Meta AI team has launched LongVU, a new spatial-temporal adaptive compression mechanism aimed at enhancing the language understanding capabilities of long videos. The technology uses DINOv2 features to eliminate redundant frames and achieves selective feature compression through cross-modal queries, performing excellently in various video understanding benchmark tests, especially in long video understanding tasks, surpassing other methods. The rapid growth of long video content requires more efficient processing methods, and the launch of LongVU brings new possibilities to the field of multi-modal understanding.

AiBase Highlights:
📽️ LongVU is a new spatial-temporal adaptive compression mechanism aimed at enhancing the language understanding capabilities of long videos.
🔍 The technology uses DINOv2 features to eliminate redundant frames and achieves selective feature compression through cross-modal queries.
🚀 LongVU performs excellently in various video understanding benchmark tests, especially in long video understanding tasks, surpassing other methods.
Details Link: https://vision-cair.github.io/LongVU/

5. AI Latte Here! Google Gemini AI Provides Support, But the Recipe Looks a Bit Dark

In Manila, Philippines, Commune has collaborated with Google Philippines to launch an AI-assisted Bibingka latte, blending traditional festive food flavors, showcasing the possibilities of modern beverage innovation. This innovative drink allows people to feel the浓厚的节日氛围, evoking nostalgia for traditional cuisine, and attracting the attention of coffee enthusiasts.

AiBase Highlights:
☕️ Beverage fusion with espresso, steamed milk, salted egg, and other local specialty ingredients, presenting authentic flavors.
🌿 AI technology combined with barista craftsmanship perfectly, showcasing the infinite possibilities of modern beverage innovation.
🤖 Commune demonstrates how to integrate cultural elements into products, highlighting the brand's creativity in seasonal products and showcasing the potential of AI in food and beverage creativity.

6. Break Free from Manual Annotation! ByteDance's MaskGCT Model Uses 100,000 Hours of Data to Teach AI to Speak on Its Own

ByteDance has released a new speech synthesis (TTS) model called MaskGCT, which completely overturns the traditional TTS model gameplay, achieving self-learning without relying on manual annotation. The model adopts a masked generative encoder-decoder Transformer architecture, allowing AI to flexibly control speech duration, achieving high-quality, similarity, and rhythmic speech synthesis effects.

AiBase Highlights:
🔥 Completely no need for manual annotation, trained through 100,000 hours of unlabeled speech data, achieving self-learning.
💡 Uses a Transformer architecture, converting speech into semantic features, then predicting acoustic features, achieving high-quality speech synthesis.
🚀 Can flexibly control speech duration, mimic different speaker styles, and even perform cross-language speech translation, showing a level comparable to that of a real person.
Details Link: https://huggingface.co/spaces/amphion/maskgct

7. Meta Releases Open-Source Version of NotebookLM, "NotebookLlama"

Meta recently launched a new tool called NotebookLlama, an open-source version of the popular podcast generation feature in Google's NotebookLM. Although NotebookLlama can convert user-uploaded files into interactive podcast-style summaries, the currently generated sound quality is low, with mechanical and overlapping sound issues. AI-generated podcasts may still contain false information, which is a common challenge for all AI projects.

AiBase Highlights:
🎧 NotebookLlama is an open-source podcast generation tool released by Meta, using the Llama model to process user-uploaded files.
🤖 The tool converts text into podcast-style summaries, but the sound quality is low, with mechanical and overlapping sound issues.
📉 AI-generated podcasts may still contain false information, a common challenge for AI projects.
Details Link: https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/NotebookLlama

8. AI Transcription Tool Whisper Exposed to Serious "Hallucinations"

Recently, the AI transcription tool driven by OpenAI's Whisper technology has gained popularity in the medical industry, but research has found that in about 1% of the transcriptions, "hallucinations" occur, even fabricating content. OpenAI states that it is working to improve the tool's performance, especially reducing hallucination phenomena.

AiBase Highlights:
🌟 Whisper transcription tool widely used in the medical industry, recording 7 million medical dialogues.
⚠️ Research found that Whisper has "hallucinations" in about 1% of transcriptions, sometimes generating meaningless content.
🔍 OpenAI continues to work on improving tool performance, especially in reducing hallucination phenomena.

9. Google Develops AI Tool "Project Jarvis," Easily Control Your Computer and Browser!

Google's latest AI tool, "Project Jarvis," will change the way people interact with computers, making AI applications simpler and more convenient. Users only need to input simple commands, and AI can automatically complete various online tasks, reducing the threshold for use. However, privacy and security issues also need attention. Google needs to strengthen security measures to protect user data.

AiBase Highlights:
🤖 Google's "Project Jarvis" AI tool can take over browsers and computers, simplifying the operation process.
🖥️ Users can automatically complete online tasks through simple commands, improving work efficiency.
🔒 Google needs to strengthen privacy and security protection, establishing comprehensive measures to address potential risks.

10. Apple's New AI System Ferret-UI 2 Redefines UI Interaction Experience

Apple's new generation of artificial intelligence system, Ferret-UI2, has made significant breakthroughs in UI element recognition, demonstrating outstanding performance. The system's greatest feature is its intelligent understanding of user intentions, achieving natural language command operations. The technical architecture is adaptive to multiple platforms, providing intelligent algorithms to adjust image resolution, ensuring computational efficiency. In the competitive UI interaction AI field, Apple's CAMPHOR framework enhances the system's ability to handle complex tasks, looking forward to the future of intelligent human-computer interaction.

AiBase Highlights:
🚀 Ferret-UI2 has made significant breakthroughs in the field of UI element recognition, outperforming GPT-4V in tests, demonstrating outstanding performance.
🔍 Ferret-UI2 has the ability to intelligently understand user intentions, operating the interface through natural language commands, enhancing user experience.
⚙️ Ferret-UI2's technical architecture is adaptive to multiple platforms, with intelligent algorithms adjusting image resolution, ensuring computational efficiency.

11. Cohere Launches the First Integrated Image and Text Search Model, Embed 3

Cohere's latest Embed 3 search model achieves seamless integration of image search and text retrieval, bringing revolutionary changes to enterprises. The new system adopts a unified storage architecture to solve the problem of maintaining multiple independent databases, supports mainstream image formats, and converts business data into vector representations, significantly improving retrieval efficiency. The updated model supports over 100 languages and has strong cross-platform compatibility.

AiBase Highlights:
🔍 Seamless integration of image search and text retrieval, bringing revolutionary changes to enterprise search methods.
💾 Unified storage architecture solves the problem of maintaining multiple independent databases, supporting mainstream image formats.
⚙️ Business data converted into vector representations, improving retrieval efficiency. Supports over 100 languages, strong cross-platform compatibility.

12. GPT-4 Surpasses Human Analysts, Financial Forecast Accuracy Rate Reaches 60%

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

AI Conversation Insight

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Ranking Optimization

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator

AI Daily: AutoGLM agents can automatically order takeout; MinShen releases a major update of the Flux version ic-light model; ByteDance's PersonaTalk enables precise AI voiceovers

站长之家

This article is from AIbase Daily

AI News Recommendations

Zhipu Open Sources AutoGLM: A Mobile-Level AI Agent That Can Order Takeout and Book Flights, Now Open to Everyone

AI Daily: Lovart AI Launches Element Separation Feature; Xcode 26.1.1 Released; Alibaba Cloud Tongyi Model Makes Its Largest-Scale Deployment for Double 11

AI Daily: HeyGen Launches AI Video Translation Engine; iFLYTEK Unveils Spark X1.5; QQ Browser Introduces AI + Small Window

AI Daily: Kunlun Tech SkyReels V3 Model Released; Moonshot AI Launches Kimi Linear Model; MiniMax Music 2.0 Released

AI Daily: Meituan's LongCat-Flash-Omni Released; Qwen3-Max Launches Deep Thinking Feature; Baidu Wenshi 5.0 Makes a Strong Return

AI Daily: Sora's Free Quota to Shrink; Moonshot Releases Kimi Linear Architecture; Canva Freely Releases Affinity Professional Design Suite

AI Daily: OpenAI Releases Browser Atlas; Tongyi Qwen3-VL Adds Two Model Sizes, 2B and 32B; Baidu Launches Recurrent Evidence Enhancement Large Model

AI Daily: Google Gemini 3.0 Pro is being rolled out on a limited scale; Aishike Technology completes B+ round financing of 100 million yuan; Baidu releases document parsing model PaddleOCR-VL

AI Daily: Zhipu AI releases AutoGLM 2.0; Tencent Yuanbao integrates with Tencent Video; ByteDance launches open-source large language model Seed-OSS

Zhipu AI launches revolutionary product AutoGLM 2.0 - One sentence of voice can replace hands to control the entire web