Recently, Google proposed a novel active learning screening process in its research, aiming to significantly reduce the amount of training data required for fine-tuning large language models. According to experimental results, this method can reduce the training data to 1/10,000 of the original, while increasing the model's consistency with human experts by 65%. In practical applications such as advertising content classification and financial data security analysis, the demand for high-fidelity training data has always been high, but selecting data that meets the requirements is not only difficult but also extremely expensive.
Image source note: The image was generated by AI, and the image licensing service provider is Midjourney.
This new method starts with an initial model that has zero or few samples. Users define the target content through prompts, such as asking whether an advertisement is a "click bait." The initial model marks the ads as click bait or benign and generates a large labeled dataset. However, the initial dataset often suffers from serious class imbalance, resulting in weak accurate identification capabilities of the model.
To address this issue, researchers grouped the content marked by the model as click bait and benign advertisements, discovering that some groups overlapped, indicating that the model tends to make errors on these contents. Therefore, researchers can select sample pairs from these overlapping groups and have experts evaluate them, thus controlling the review cost and prioritizing sample pairs that cover various situations. The resulting samples are both valuable and cover various potential error scenarios.
During the model fine-tuning process, the expert-labeled data is divided into two groups, one for evaluating the model's consistency and the other for fine-tuning the model. This process continues until the model's performance reaches a level comparable to that of human experts.
Google's experiments used the Gemini Nano-1 and Nano-2 models and tested them on two tasks with different levels of complexity. In the tests, each task used approximately 100,000 crowdsourced annotated data, although these data were severely imbalanced. The results showed that the consistency among experts was very high, while the consistency between crowdsourced labels and expert judgments was relatively low. Using the new method, a 3.25 billion parameter model showed significant improvement in alignment on a low-complexity task, using only 250-450 data points instead of the original 100,000, still achieving good results.
In summary, Google's new method demonstrates that even with a small amount of high-quality data and ensuring expert annotation consistency exceeds 0.8, large models can achieve excellent performance during training.
Key Points:
📉 The amount of training data can be reduced to 1/10,000, improving model accuracy.
🤝 The new method relies on expert judgment and model iteration to ensure sample quality.
📊 Experiments show that a small amount of high-quality data can achieve or even exceed the effects of traditional large-scale data.