Translated data: CulturaX is a vast multilingual dataset designed for the research and development of large language models. This dataset encompasses 167 languages, meticulously cleaned and deduplicated to ensure high-quality data for training multilingual LLMs. The advancement of large language models relies on extensive models and broad training datasets, highlighting the challenges in current multilingual learning, including data quality and the scarcity of multilingual data. The public release of CulturaX holds significant importance for the research and development of multilingual LLMs, providing valuable resources for researchers and developers.