PowerInfer
High-speed large language model local deployment inference engine
CommonProductProductivityLanguage ModelInference Engine
PowerInfer is an engine for performing high-speed inference of large language models on consumer-grade GPUs within personal computers. It leverages the high locality of LLM inference by pre-loading hot-activated neurons to the GPU, significantly reducing GPU memory requirements and CPU-GPU data transfer. PowerInfer also integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and sparse computation. It can achieve an average generation speed of 13.20 tokens per second on a single NVIDIA RTX 4090 GPU, only 18% slower than the top-tier server-grade A100 GPU while maintaining model accuracy.
PowerInfer Visit Over Time
Monthly Visits
513197610
Bounce Rate
36.07%
Page per Visit
6.1
Visit Duration
00:06:32