Tarsier

Tarsier is a large video language model developed by ByteDance that generates high-quality video descriptions.

CommonProductVideoVideo DescriptionVideo Understanding
Tarsier is a series of large-scale video language models developed by the ByteDance research team, designed to generate high-quality video descriptions and equipped with robust video comprehension capabilities. The model significantly enhances the accuracy and detail of video descriptions through a two-stage training strategy (multi-task pre-training and multi-granularity instruction fine-tuning). Its main advantages include high precision in video description, understanding of complex video content, and achieving state-of-the-art (SOTA) results in multiple video comprehension benchmark tests. The model's development addresses the shortcomings in detail and accuracy of existing video language models, achieving new heights in video description through extensive training on high-quality data and innovative training methods. Currently, the model is not explicitly priced and is mainly targeted at academic research and commercial applications, suitable for scenarios requiring high-quality understanding and generation of video content.
Visit

Tarsier Visit Over Time

Monthly Visits

474564576

Bounce Rate

36.20%

Page per Visit

6.1

Visit Duration

00:06:34

Tarsier Visit Trend

Tarsier Visit Geography

Tarsier Traffic Sources

Tarsier Alternatives