This product is a self-rewarding language model trained using LLM as a judge and rewards signals generated by the model itself. Through iterative DPO training, the model not only improves its ability to follow instructions but also generates high-quality self-rewards. After three iterations of fine-tuning, this product has surpassed many existing systems, including Claude 2, Gemini Pro, and GPT-4 0613, on the AlpacaEval 2.0 leaderboard. While this work is preliminary research, it opens the door to the possibility of continuous improvement in the model in two key areas.