SageAttention

Public

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

attention cuda efficient-attention inference-acceleration llm mlsys quantization triton video-generation

作成時間：2024-10-03T17:33:18

更新時間：2025-03-27T08:33:32

2.0K

Stars

Stars Increase

関連プロジェクト

Annotated_deep_learning_paper_implementations

attention

??? 60+ Implementations/tutorials of deep learning papers with side-by-side notes ?; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, sophia, ...), gans(cyclegan, stylegan2, ...), ? reinforcement learning (ppo, dqn), capsnet, distillation, ... ?

62098

3个月前

+23today

Vllm

Hot

amd

A high-throughput and memory-efficient inference and serving engine for LLMs

53050

1年前

+131today

Vit Pytorch

artificial-intelligence

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

23453

3个月前

+10today

Sglang

cuda

SGLang is a fast serving framework for large language models and vision language models.

16256

3个月前

+31today

Numpy Ml

attention

Machine learning, in numpy

16130

3个月前

Leedl Tutorial

bert

《李宏毅深度学习教程》（李宏毅老师推荐?，苹果书?），PDF下载地址：https://github.com/datawhalechina/leedl-tutorial/releases

15502

4个月前

+6today

Kaldi

c-plus-plus

kaldi-asr/kaldi is the official location of the Kaldi project.

15000

4个月前

Nlp Tutorial

attention

Natural Language Processing Tutorial for Deep Learning Researchers

14674

11个月前

+1today

RWKV LM

attention-mechanism

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 "Goose". So it's combining the best of RNN and transformer - great performance, linear time, constant space (no kv-cache), fast training, infinite ctx_len, and free sentence embedding.

13813

4个月前

+9today