In recent years, the development of large language models has been rapid, with BERT emerging as the most popular and efficient model. However, its complexity and scalability pose challenges. To address these issues, compression algorithms such as knowledge distillation, quantization, and pruning have been employed, with knowledge distillation being the primary method. This technique involves training a smaller model to mimic the behavior of a larger one, thereby achieving model compression. DistilBERT, learned from BERT and updated with weights through three components including masked language modeling loss, distillation loss, and similarity loss, is smaller, faster, and more cost-effective than BERT, yet still maintains comparable performance. The architecture of DistilBERT incorporates some best practices in performance optimization, offering the possibility of deployment on resource-constrained devices. Through knowledge distillation techniques, DistilBERT significantly compresses large language models while preserving their performance.