Differential-Transformer-PyTorch
PublicPyTorch implementation of the Differential-Transformer architecture for sequence modeling, specifically tailored as a decoder-only model similar to large language models (LLMs). The architecture incorporates a novel Differential Attention mechanism, Multi-Head structure, RMSNorm, and SwiGLU.
Creat:2024-10-08T21:48:40
Update:2025-03-21T17:35:01
73
Stars
0
Stars Increase