2023-10-18 09:19:57.AIbase.2.2k
Stanford PhD Develops Flash-Decoding Method to Speed Up LLM Inference by 8 Times
The FlashAttention team developed Flash-Decoding to accelerate inference in large Transformer architectures, achieving up to 8 times speedup. Flash-Decoding significantly improves inference speed by parallelly loading Key and Value caches. Benchmark tests show that Flash-Decoding enhances long sequence decoding speed by 8 times while offering better scalability. This new method provides an efficient solution for large Transformer models, particularly...