Apple Unveils 4M Model Demo for Effortless Image Information Extraction

AIbase

Published inAI News · 5 min read · Jul 5, 2024

Apple dropped a bombshell on Hugging Face, releasing a demonstration of their 4M model paper from last year. This model is capable of handling and generating various modal content, including text, images, and 3D scenes. A single model can extract all the information from an image, including depth maps and line drawings. AIbase tested it with previously generated ancient-style imagery, and indeed, it's quite impressive. After uploading an image, it quickly obtained the following breakdown infographic:

QQ截图20240705100442.jpg

Just by uploading a photo, you can easily obtain all the information about that photo, such as the main contours, the dominant colors in the scene, and the image size, etc.

This represents a bold turn for Apple in the traditional secrecy of research and development. They not only showcased their AI capabilities on the open-source AI stage of Hugging Face but also extended an olive branch to developers, hoping to build an ecosystem around the 4M. The multi-modal architecture of 4M heralds the possibility of more coherent and multifunctional AI applications in the Apple ecosystem, such as Siri becoming more intelligent in handling complex queries or Final Cut Pro automatically editing videos based on your spoken instructions.

However, the introduction of 4M also brings challenges in data practice and AI ethics. Apple has always claimed to be the guardian of user privacy, but will their stance be tested by this data-intensive AI model? Apple must carefully balance ensuring that while pushing technological progress, the trust of users is not compromised.

Let's take a brief look at the technical principles behind 4M. The biggest highlight of 4M is its "large-scale multi-modal masked modeling" training method, which can handle multiple visual modalities simultaneously, converting images, semantics, and geometric information into unified tokens for seamless inter-modal integration.

During the training process, 4M employs a clever approach: randomly selecting part of the annotations as input and the other part as the target, thereby achieving the scalability of training goals. This means that for 4M, both images and text are just strings of digital tokens, which greatly enhances the generalizability of the model.

The training data and methods of 4M are also noteworthy. It uses one of the largest open-source datasets, CC12M, although the dataset is rich in data but lacks in annotation information. To address this, researchers adopted the weak supervision pseudo-labeling method, using technologies like CLIP and MaskRCNN to make comprehensive predictions on the dataset, and then converting these predictions into tokens to lay the foundation for 4M's multi-modal compatibility.

After extensive experiments and testing, 4M has proven to be capable of executing multi-modal tasks directly without the need for large amounts of specific task pre-training or fine-tuning. This is like giving AI a multi-modal Swiss Army knife, allowing it to flexibly respond to various challenges.

Demo link: https://huggingface.co/spaces/EPFL-VILAB/4M

Institution: Downgrade the year-over-year growth rate of AI server shipments in 2025

North American large CSPs remain the main driving force behind the demand for AI servers, supported by tier-2 data centers as well as sovereign cloud projects in the Middle East and Europe. Overall demand remains stable. Driven by the demand from North American CSPs and OEMs, it is expected that AI server shipments will continue to grow at double-digit rates in 2025. However, due to changes in the international situation, the year-over-year growth rate of global AI server shipments in 2025 has been revised downward to 24.3%.

WeChat AI Search Accused of Forced 'Opening the Box' to Name, Turning into Hyperlink Instantly - Tencent Responds: Only Integrates Public Information

The newly launched AI search function in WeChat has attracted widespread attention due to allegations of leaking personal privacy. Recently, many users reported on social platforms that this function can generate a personal resume with a name hyperlink in one click, causing concerns about privacy security among users. According to user feedback, the controversy surrounding WeChat AI Search mainly focuses on its automatic identification mechanism. When users encounter names in WeChat official account articles, the system automatically converts the name into a blue hyperlink. Clicking this link will force the AI system to generate a detailed information page containing personal resume, as well as display all

JD.com's Embodied Intelligence Strategy Accelerates Rapidly, JoyInside Collaboration Map Exposed

According to NetEase Technology, JD.com's layout in the field of embodied intelligence is accelerating rapidly. The embodied intelligence brand JoyInside under JD.com has reached cooperation with more than ten leading robot companies, becoming the core engine for JD.com to seize the smart robot market. According to insiders, JoyInside is supported by JD's large model technology, focusing on providing smart interaction capabilities between robots and consumers. Its product strategy focuses on scenario-based applications such as one person, one dog, and one toy. Since its launch, the brand has successfully attracted leading enterprises from multiple niche fields to join.

Foxconn Launches Its First AI Inference Large Model FoxBrain, Trademark Application Submitted

Recently, Hon Hai Precision Industrial Co., Ltd. (commonly known as Foxconn) submitted a trademark registration application for "FoxBrain" to the Trademark Office of the National Intellectual Property Administration. This AI inference large model is not only Foxconn's first attempt but also the first AI model of this type in Taiwan. According to public information, the international classification of this trademark is scientific instruments, and it is currently in the "waiting for substantive examination" status. "FoxBrain" is an AI inference large model launched by the Hon Hai Research Institute, covering data analysis

Zhipu AI Open Sources GLM-4.1V-Thinking: A Breakthrough in Multimodal Reasoning

Zhipu AI officially open-sources its latest general vision model, GLM-4.1V-Thinking, based on the GLM-4V architecture, which introduces a chain-of-thought reasoning mechanism, significantly enhancing its capabilities for complex cognitive tasks. The model supports multimodal inputs such as images, videos, and documents, and excels in diverse scenarios including long video understanding, image question answering, subject problem-solving, text recognition, document interpretation, grounding, GUI Agent, and code generation, covering a wide range of industry application needs. GLM-4.1V-9B-Thinking

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Welcome to the [AI Daily] section! Here is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and learn about innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1、Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand audio and directly generate natural speech. Step-Audio-AQAA is an open source end-to-end speech large model,

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Apple Unveils 4M Model Demo for Effortless Image Information Extraction

AIbase

This article is from AIbase Daily

AI News Recommendations

Institution: Downgrade the year-over-year growth rate of AI server shipments in 2025

Baidu Launches the World's First Chinese Audio-Visual Generation Model MuseSteamer, Revolutionizing the Creative Process

WeChat AI Search Accused of Forced 'Opening the Box' to Name, Turning into Hyperlink Instantly - Tencent Responds: Only Integrates Public Information

JD.com's Embodied Intelligence Strategy Accelerates Rapidly, JoyInside Collaboration Map Exposed

Foxconn Launches Its First AI Inference Large Model FoxBrain, Trademark Application Submitted

Zhipu AI Launches GLM-4.1V-Thinking Open Source! A New Leader in Multimodal Reasoning, Challenging Top Models Worldwide

Zhipu AI Open Sources GLM-4.1V-Thinking: A Breakthrough in Multimodal Reasoning

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand Audio and Generate Natural Speech Directly

Ant Group's Medical AI Platform Wins SAIL Award at 2025 World Artificial Intelligence Conference