Google has launched the StreetReaderAI prototype system, helping blind and low-vision users to independently explore Google Street View through natural language interaction. The system integrates computer vision, geographic information systems, and large language models, enabling a multimodal AI-driven real-time conversational street view experience, breaking through the limitations of traditional voice announcements and enhancing the freedom of accessible urban exploration.
At the Midea Group Visionaries Conference, Steve Zhou, co-founder of Zhiyuan Robots, predicted that artificial intelligence is rapidly moving toward general intelligence (AGI), which may be initially achieved after GPT-6. He reviewed the development of AI over the past decade, from the application of computer vision in 2015 to the emergence of an AGI prototype in 2025, highlighting the rapid progress.
Amazon is developing AI smart glasses for delivery drivers to enhance efficiency and safety. The glasses use AI sensors and computer vision to display hazards, navigation, and package info, enabling hands-free scanning and delivery confirmation.....
Apple CEO Cook announced in Shanghai that its AI technology, Apple Intelligence, is fully entering the Chinese market. He emphasized that AI will profoundly change lifestyles, even save lives, and urged the public not to be overly concerned. Cook believes AI will bring positive impacts and is not worried that computers will think like humans.
Auron AI turns your computer into an intelligent companion, helping you manage tasks, automate operations, and communicate naturally.
Runable is a general-purpose automation agent that can automate any digital task that humans perform on a computer.
A virtual computer assistant that can perform tasks such as searching or creating images.
Vy represents the future of computer interfaces, using advanced artificial intelligence technology to change human-computer interaction.
noctrex
Gelato-30B-A3B is a state-of-the-art (SOTA) model fine-tuned for GUI computer usage tasks, offering a quantized version to optimize deployment efficiency. This model is specifically designed to understand and process tasks related to graphical user interfaces.
timm
This is a vision Transformer model based on the DINOv3 framework, trained on the LVD-1689M dataset from the DINOv3 ViT-7B model through knowledge distillation technology. This model is specifically designed for image feature encoding and can efficiently extract image feature representations, suitable for various computer vision tasks.
This is a vision Transformer model based on the DINOv3 architecture, using a small configuration and trained through knowledge distillation on the LVD-1689M dataset. This model is specifically designed for efficient image feature extraction and supports various computer vision tasks such as image classification, feature map extraction, and image embedding.
Piero2411
This is a computer vision model based on the YOLOv8s architecture, specifically designed for barcode and QR code detection. The model has been fine-tuned on a comprehensive dataset containing more than 5000 images, supporting accurate detection and classification of multiple barcode types (such as EAN13, Code128, etc.) and QR codes.
logasanjeev
A powerful computer vision tool capable of classifying, detecting, and extracting text from Indian ID card documents.
lmstudio-community
An image-text to text generation model based on the Transformer architecture, designed specifically for computer/GUI-related scenarios, with intelligent agent capabilities.
Zeta-LLM
Zeta 2 is a small language model (SLM) with approximately 460 million parameters, meticulously crafted on consumer-grade computers and supports multiple languages.
Kar1hik
This model is fine-tuned based on the DINOv2 architecture for disease classification of skin lesion images
onnx-community
This is the ONNX format version of the facebook/dinov2-base model, suitable for computer vision tasks.
nvidia
The first hybrid computer vision model combining the strengths of Mamba and Transformer, enhancing visual feature modeling efficiency by reconstructing the Mamba formula, and introducing self-attention modules in the final layers of the Mamba architecture to improve long-range spatial dependency modeling.
MambaVision is the first hybrid computer vision model combining the strengths of Mamba and Transformer. It enhances visual feature modeling by redesigning the Mamba formulation and incorporates self-attention modules in the final layers of the Mamba architecture to improve long-range spatial dependency modeling.
The first hybrid computer vision model combining the advantages of Mamba and Transformer, enhancing visual feature modeling capability by reconstructing the Mamba formula
The first hybrid computer vision model combining the strengths of Mamba and Transformer, enhancing visual feature modeling efficiency through reconstructed Mamba formulas and introducing self-attention modules at the end of the Mamba architecture to improve long-range spatial dependency modeling.
mestrevh
This is a Vision Transformer (ViT) model fine-tuned on a legume dataset for identifying disease conditions in legume leaves.
ETH-CVG
LightGlue is an efficient keypoint detection and matching model for feature matching and pose estimation problems in computer vision.
cortexso
Deepscaler is an advanced AI model developed based on DeepScaleR-1.5B-Preview, focusing on improving the efficiency and scalability of machine learning tasks. This model provides high-quality predictive analysis and data processing capabilities, suitable for complex scenarios such as natural language processing and computer vision, and has wide applications in industries such as finance, healthcare, and entertainment.
Bojun-Feng
Qwen2.5 0.5B Instruct GGUF - llamafile is an open-source large language model solution based on the Qwen2.5 0.5B model. It uses llamafile technology to achieve single-file operation and can be deployed and used on local computers without installation. This model performs excellently in coding, mathematics, instruction following, and multilingual support.
kucher7serg
A Flux LoRA model trained on a local computer using Fluxgym for generating images of young male figures.
AIM-v2 is an efficient image encoder implemented based on the timm library, suitable for various computer vision tasks.
AIM-v2 is an efficient image encoder model compatible with the timm framework, suitable for computer vision tasks.
Contains computer control and automation components for MCP servers
Showcasing the integration of computer vision tools with language models through MCP
MCP-DBLP is a service based on the Model Context Protocol (MCP) that provides large language models with the ability to access the DBLP computer science literature database, including functions such as search, citation processing, and BibTeX export.
An OpenAI agent server based on the MCP protocol, providing various professional agents (such as web search, file search, and computer operation) and a multi-agent coordinator, which can interact with clients (such as the Claude desktop application) through the MCP protocol.
A TypeScript - based MCP server for file system editing tools, ported from the Anthropic computer usage demonstration.
A computer vision server implemented based on Ultralytics and the MCP protocol, supporting functions such as object detection, image segmentation, and pose estimation
An MCP server for seamless integration with computer peripherals, providing a unified API to control, monitor, and manage hardware devices, including cameras, printers, audio devices, and screens.
A server based on the MCP protocol, providing the function of querying the prices of computer components on the CoolPC website in Taiwan and automatically generating computer configuration quotes.
The YOLO MCP Service is a powerful computer vision service that integrates with Claude AI through the Model Context Protocol (MCP), providing functions such as object detection, segmentation, classification, and real-time camera analysis.
A TypeScript-based MCP server for interacting with DAOs on the Internet Computer
Computer control and automation components of the MCP server
Desktop Commander MCP is a service that enables the Claude desktop application to execute terminal commands on the user's computer and manage processes through the Model Context Protocol (MCP). It provides terminal command execution, process management, file system operations, and code editing functions, supporting long-running commands and differential file editing.
An MCP server based on nut.js that provides comprehensive control functions for the computer screen, mouse, and keyboard, including screenshot, mouse operation, keyboard input, window management, and clipboard access.
An MCP service that allows Claude to control audio playback on the computer
An MCP server that provides computer control functions, including mouse and keyboard control, OCR recognition, window management, etc., implemented based on PyAutoGUI and RapidOCR without external dependencies.
Claude Desktop Commander MCP is a server tool that allows the Claude desktop application to execute terminal commands and manage processes on the user's computer. It is built based on the Model Context Protocol (MCP) and provides file system operations and code editing functions.
Koppla is a natural language-based Active Directory management tool that enables querying and operating on user, group, and computer objects through an AI agent.
Android-MCP is a lightweight open-source project that serves as a bridge between AI agents and Android devices. It enables real-world task operations such as app navigation, UI interaction, and automated testing through the MCP server, without relying on traditional computer vision or pre-set scripts.
An MCP server that provides audio input/output functions, supporting the interaction between AI assistants like Claude and the computer's audio system, including functions such as recording and playing audio files.
termiAgent is an LLM-based command-line assistant that supports plugin role settings and MCP server connection, aiming to simplify computer operations.