The AI news app Particle introduces a podcast clips feature that condenses 45-minute audio into 45 seconds, breaking the boundary between text and audio, and providing a more efficient way to access news.
DeepSeek OCR2 introduces a novel visual encoder that simulates human scanning patterns, replacing traditional CLIP components with a lightweight language model for adaptive content focus in document and image processing.....
Artificial intelligence company Clipto.AI has completed its Pre-A++ funding round, with a valuation exceeding $250 million. The round was led by EnvisionX Capital and Palm Drive Capital, with existing shareholders including Sequoia China and Hengan Innovation investing alongside. The funds will be primarily used for the research and development of edge-side multimodal AI models and systems.
ByteDance and Nanyang Technological University jointly developed the StoryMem system, aimed at solving the problem of inconsistent character appearance in AI-generated videos. The system stores key frames and references them in subsequent scenes to ensure that characters and environments remain consistent across multiple scenes. Existing models such as Sora perform well in short clip generation, but still face challenges when assembling multi-scene stories.
FragCut: Empowered by AI, generate game clips 10 times faster and automatically detect the highlight moments in Valorant.
Easily capture the wonderful moments of videos and quickly create popular video clips. It is suitable for video enthusiasts and creators.
Create Christmas videos with AI in just a few minutes. Select a template and click generate to get a video clip with music.
Powered by AI, it transforms long - form videos into high - quality viral clips for sharing across multiple platforms.
Baidu
-
Input tokens/M
Output tokens/M
32
Context Length
Tencent
$3
$9
16
sd2-community
Stable Diffusion v2-1-unclip is a diffusion model fine-tuned based on Stable Diffusion 2.1. It can accept text prompts and CLIP image embeddings, and is used to create image variants or be used in combination with the text-to-image CLIP prior.
AbstractPhil
MM-VAE Lyra is a multimodal variational autoencoder specifically designed for text embedding conversion, using geometric fusion technology. It combines the CLIP-L and T5-base models and can effectively handle the encoding and decoding tasks of text embeddings, providing an innovative solution for multimodal data processing.
bn22
This is a Transformer model published on the Hugging Face model hub. The model card is automatically generated by the system, and specific model information needs to be further supplemented.
birder-project
This is a ViT-L14 image encoder based on the PE-Core model by Bolya et al., which has been converted to the Birder format for image feature extraction. The model retains the original weights and architecture but removes the CLIP projection layer to output raw image embeddings. It is a general-purpose visual backbone network suitable for image classification and detection tasks.
anhquanlam
This is an automatically generated 🤗 Transformers model card, lacking specific model information.
redlessone
DermLIP is a dermatological vision-language model trained on the Derm1M dataset. It uses a CLIP-style contrastive learning method and is specifically optimized for dermatological images and texts, supporting various application scenarios such as zero-shot classification and few-shot learning.
DermLIP is a vision-language model specifically designed for the field of dermatology, trained on the largest dermatology image-text corpus, Derm1M. This model adopts a CLIP-style architecture and can perform various dermatology-related tasks, including zero-shot classification, few-shot learning, cross-modal retrieval, and concept annotation.
ibm-esa-geospatial
Llama3-MS-CLIP is the first visual language model in the CLIP family that can understand multispectral images. It is trained on one million image-text pairs from the SSL4EO-S12-v1.1 dataset and the generated descriptions, and outperforms other RGB-based models in most benchmark tests.
amildravid4292
Based on the OpenCLIP-ViT-L-14 model, the Test-Time Register technology is introduced to improve the model's interpretability and the performance of downstream tasks.
A vision-language model based on the OpenCLIP-ViT-B-16 architecture. By introducing test-time registers to optimize the internal representation, it solves the problem of feature map artifacts.
lukahh
Vision-language model fine-tuned based on CLIP-ViT-B/32, suitable for image-text matching tasks
UCSC-VLAA
OpenVision is a fully open-source, cost-effective advanced visual encoder family designed for multimodal learning, with performance matching or surpassing OpenAI CLIP.
EduFalcao
A vision-language model fine-tuned based on the CLIP architecture, specifically designed for zero-shot classification of plant diseases
SpursgoZmy
Table LLaVA 7B is an open-source multimodal chatbot specifically designed to understand table images and can perform various table-related tasks such as table question answering, table cell description, and structure understanding. This model is based on the LLaVA-v1.5 architecture, using CLIP-ViT-L-336px as the visual encoder and Vicuna-v1.5-7B as the base large language model.
LEAF-CLIP
A feature extraction model fine-tuned based on openai/clip-vit-large-patch14, optimizing the image and text encoders
it-just-works
This is a speaker segmentation model based on powerset encoding, capable of processing 10-second audio clips and identifying multiple speakers and their overlapping situations.
kshitij3188
PHOENIX is a domain-adaptive model based on CLIP/ViT, designed to enhance patent image retrieval capabilities, particularly suitable for retrieving semantically or hierarchically related images rather than exact matches.
epchannel
viⓍTTS is a voice generation model capable of cloning voices into different languages using a 6-second short audio clip.
vidi-deshp
This is a fine-tuned version of CLIP-GPT2 for real-time image captioning tasks, designed to assist visually impaired individuals in understanding image content.
Jialuo21
SciScore is a fine-tuned scientific scoring model based on the CLIP-H model, used to evaluate the scientific alignment between implicit prompts and generated images.
Clippy is a macOS terminal clipboard tool that supports copying file references, GUI pasting, management of recently downloaded files, processing of pipeline data, and AI integration with the MCP server, improving work efficiency.
A tool that provides MCP services for the Windows API, supporting functions such as media control, notification sending, window management, screenshot, display control, theme setting, start menu, and clipboard operations.
A Video Editing MCP Server based on FFmpeg that supports performing operations such as video clipping, merging, and format conversion through natural language commands
SeekCode is a modern desktop application focusing on code snippet management and clipboard integration. It supports multi - language syntax highlighting, tag search, and has a built - in MCP server for automated access by AI assistants.
An MCP server for Windows systems, providing functions such as media control, notification sending, window management, screenshot taking, monitor control, theme setting, file/URL opening, and clipboard operations.
An implementation of a password generation server based on the Model Context Protocol (MCP), supporting both character and word password generation methods, and capable of copying the generated password to the clipboard.
A Perplexity AI desktop application based on Electron, with full system permissions and features, including clipboard operations, drag-and-drop functionality, voice and media permissions, etc.
The Maccy Clipboard MCP Server is a service tool that exposes Maccy's clipboard history to AI assistants such as Claude. It supports searching, viewing, and managing clipboard content, including image support and data statistics functions. However, be aware of the risk of sensitive data leakage.
A powerful YouTube content access MCP server that provides full access to video transcription, metadata, comments, screenshots, and audio clips, supporting both the desktop and web versions of Claude.
This is a fashion recommendation system based on CLIP. It detects the clothing pictures uploaded by users through YOLO and recommends similar products after encoding with CLIP. The project has completed the construction of the FastAPI server, database connection, and basic front - end UI. The next step is to optimize the label accuracy of CLIP and system integration.
An MCP server based on nut.js that provides comprehensive control functions for the computer screen, mouse, and keyboard, including screenshot, mouse operation, keyboard input, window management, and clipboard access.
An MCP server implemented through the FFmpeg command line, supporting functions such as local video search, clip, splicing, and play.
An MCP server for retrieving clipboard content, currently only supporting image content on the MacOS system.
An MCP server that provides access to the macOS clipboard via OSAScript
A cross - platform clipboard MCP server that supports macOS, Windows, and Linux systems, providing functions to read and set clipboard content
Bridge MCP is a Windows PC control server based on the Model Context Protocol (MCP), allowing any AI to fully control the computer through a local proxy program, including application control, mouse and keyboard operations, screen capture, system command execution, browser automation, and clipboard management.
The MCP server based on the GLM-4.5V model provides intelligent image analysis functions, supporting image acquisition from file paths or the clipboard. It is specifically designed for code content extraction, architecture analysis, error detection, and documentation generation.
MCP System Bridge is a bridging tool that implements the Model Context Protocol (MCP), providing access to native operating system functions, such as clipboard management, URL handling, and date information retrieval.
A fully functional MCP server that offers 73 tools covering 11 modules including file system, diagnostics, scripts, time management, network, context, Git operations, user input, version control, clipboard, and text conversion.
A fashion recommendation system based on CLIP that realizes similar product recommendations through image recognition and encoding.