Recently, DeepSeek released an updated version of its R1 reasoning AI model, which demonstrated excellent performance in multiple math and coding benchmarks. However, DeepSeek did not disclose the source of its training data, raising questions among some AI researchers who speculated that the model may have been partially trained on Google's Gemini AI series.
Sam Paeach, a developer from Melbourne, claimed that he found many similarities in word choice and expression between DeepSeek's R1-0528 model and Google Gemini2.5Pro. Although this cannot serve as direct evidence, another developer — the anonymous founder of the SpeechMap project — also mentioned that the "thought trajectory" produced by DeepSeek models during reasoning was identical to Gemini's performance. This finding once again sparked discussions about whether DeepSeek used competitors' data during training.
Image Source Note: Image generated by AI, image authorization service provided by Midjourney
As early as last December, DeepSeek had been criticized for its V3 model frequently identifying itself as OpenAI's ChatGPT, which hinted that the model might have been trained using ChatGPT chat logs. Earlier this year, OpenAI revealed to the media that it found evidence related to "data distillation" technology used by DeepSeek. "Data distillation" is a method of training new models by extracting information from large models. Bloomberg reported that Microsoft discovered at the end of 2024 that much of the data was leaked through OpenAI developer accounts, which may be associated with DeepSeek.
Although "distillation" technology is not uncommon in the AI community, OpenAI explicitly prohibits users from building competitive products using its model outputs. It should be noted that due to the abundance of low-quality content on the open web, many AI models often mistakenly mimic each other's word choices and phrasing during training. This makes it more complex to deeply analyze the source of training data.
Nathan Lambert, an AI expert, believes that it is not impossible for DeepSeek to have trained its models using Google Gemini data. He mentioned that DeepSeek has sufficient funds to use the best API models available to generate synthetic data. To prevent data from being distilled, AI companies are also constantly strengthening their security measures. For example, OpenAI has started requiring organizations to complete identity verification to access certain advanced models, while Google is also working to enhance the security of its AI Studio platform and limit access to model generation trajectories.