MoMask is an innovative 3D human motion generation model that utilizes a hierarchical quantization scheme to represent actions, consisting of a base layer and progressively added residual markers. The model architecture introduces Masked Transformer and Residual Transformer to predict the mask action markers for the base layer and gradually predict higher-level markers. In text-to-motion generation tasks, MoMask demonstrates exceptional performance, achieving an FID of 0.045 on the HumanML3D dataset, significantly surpassing other models.