DeepGEMM is a CUDA library focused on high-performance FP8 matrix multiplication. Through fine-grained scaling and various optimization techniques such as Hopper TMA features, persistent thread specialization, and a fully JIT design, it significantly improves matrix computation performance. Primarily aimed at deep learning and high-performance computing, it's suitable for scenarios requiring efficient matrix operations. It supports NVIDIA Hopper architecture Tensor Cores and demonstrates superior performance across various matrix shapes. DeepGEMM boasts a concise design with a core codebase of approximately 300 lines, making it easy to learn and use while achieving performance comparable to or exceeding expert-optimized libraries. Its open-source and free nature makes it an ideal choice for researchers and developers engaged in deep learning optimization and development.