An advanced machine learning model designed for image and text data, focusing on data quality and transparency.
facebook
MetaCLIP 2 (worldwide) is a multilingual zero-shot image classification model based on the Transformer architecture. It supports visual language understanding tasks globally and can classify images without training.
A 7-billion-parameter visual Transformer model trained on 8 billion MetaCLIP data using the DINOv2 self-supervised learning framework, requiring no language supervision
A 224-resolution Vision Transformer model based on 2 billion MetaCLIP data, trained using DINOv2 self-supervised learning method
A 3-billion parameter vision Transformer model trained on 2 billion carefully curated MetaCLIP data using DINOv2 self-supervised learning framework
timm
A dual-framework compatible vision model trained on MetaCLIP-2.5B dataset, supporting both OpenCLIP and timm frameworks
A dual-purpose vision-language model trained on the MetaCLIP-2.5B dataset, supporting zero-shot image classification tasks
A dual-framework compatible vision model trained on MetaCLIP-2.5B dataset, supporting zero-shot image classification tasks
Vision Transformer model trained on MetaCLIP-400M dataset, supporting zero-shot image classification tasks
A dual-framework compatible vision model trained on the MetaCLIP-2.5B dataset, supporting both OpenCLIP and timm frameworks
A dual-framework compatible vision model trained on the MetaCLIP-400M dataset, supporting both OpenCLIP and timm frameworks
A vision Transformer model trained on the MetaCLIP-2.5B dataset, compatible with both open_clip and timm frameworks
A vision-language model trained on the MetaCLIP-400M dataset, supporting zero-shot image classification tasks
MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces.
MetaCLIP is a large-scale vision-language model trained on 2.5 billion data points from CommonCrawl (CC), revealing CLIP's data filtering methodology
MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces
MetaCLIP is an implementation of the CLIP framework applied to CommonCrawl data, aiming to reveal CLIP's training data filtering methods
MetaCLIP is a vision-language model based on CommonCrawl data, improving CLIP model performance through enhanced data filtering methods
MetaCLIP is a vision-language model trained on 2.5 billion data points from CommonCrawl (CC) to construct a shared image-text embedding space.
The MetaCLIP base model is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces.