ViTPose is an open-source action estimation model that excels at recognizing human postures, as if it can understand the actions you are performing. The standout feature of this model is its simplicity and efficiency; it does not use complex network structures but directly employs a technique called Vision Transformer. The core of ViTPose uses a pure Vision Transformer, which acts like a powerful 'skeleton' to extract key features from images. Unlike other models, it does not require complexity.
A collection of ViTPose models implemented based on the Transformer architecture.
usyd-community
ViTPose++ is a vision Transformer-based foundational model for human pose estimation, achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set.
ViTPose++ is a vision Transformer-based foundation model for human pose estimation, achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set.
ViTPose++ is a vision Transformer-based human pose estimation model, achieving outstanding performance of 81.1 AP on the MS COCO keypoint detection benchmark.
stanfordmimi
SynthPose is a keypoint detection model based on the VitPose huge backbone network, fine-tuned with synthetic data to predict 52 human keypoints, suitable for kinematic analysis.
SynthPose is a 2D human pose estimation model based on VitPose Base, fine-tuned with synthetic data, capable of predicting 52 anatomical keypoints
ViTPose is a vision Transformer-based human pose estimation model that achieves an outstanding performance of 81.1 AP on the MS COCO keypoint detection benchmark with a simple design.
ViTPose is a human pose estimation model based on Vision Transformer, achieving outstanding performance on benchmarks like MS COCO through simple architectural design.
A vision Transformer-based human pose estimation model achieving an outstanding performance of 81.1 AP on the MS COCO keypoint test set
ViTPose is a human pose estimation model based on Vision Transformer, achieving 81.1 AP accuracy on the MS COCO keypoint test set, with advantages such as model simplicity, scalable size, and flexible training.
danelcsb
ViTPose is a baseline model for human pose estimation based on plain vision transformers, achieving high-performance keypoint detection with a simple architecture
onnx-community
A lightweight pose estimation model based on ViT architecture for human keypoint detection
nielsr
This is a keypoint detection model based on transformers, used to identify keypoint positions in images
shauray
This model is used to detect keypoints in images or videos, suitable for tasks such as human pose estimation and facial landmark detection.