Large language models excel at adapting to new tasks, and further refinement through training typically boosts performance significantly, yet acquiring sufficient labelled data remains a major obstacle for many applications. Tzu-Hsuan Chou and Chun-Nan Chou, from CMoney Technology Corporation, address this challenge with a new learning framework, LAUD, which intelligently combines the power of large language models with active learning techniques. LAUD overcomes the initial hurdle of needing labelled data by first leveraging the model’s existing knowledge, then strategically selecting the most informative data points to label, dramatically improving performance on tasks like classifying commodity names. The results demonstrate that language models enhanced by LAUD surpass those relying on traditional zero-shot or few-shot learning approaches, offering a practical and efficient solution for real-world applications where labelled data is scarce.

LLMs and Active Learning for Data Efficiency

This research introduces LAUD, a framework designed to effectively utilize large language models (LLMs) for tasks where labeled data is scarce. The core idea combines the strengths of LLMs, their ability to make predictions without prior training, with the efficiency of active learning, strategically selecting the most informative data points for labeling. This creates high-performing models with minimal labeling effort. Experiments in commodity name classification demonstrate that models trained with LAUD significantly outperform LLMs without task-specific refinement and even surpass models trained with standard supervised learning techniques.

The LAUD framework involves initially utilizing the LLM’s inherent capabilities to make predictions on unlabeled data, then iteratively selecting the most informative unlabeled data points for annotation based on uncertainty or other criteria. The LLM is then retrained with the newly labeled data, and this process repeats until desired performance is achieved. This innovative approach demonstrates the framework’s ability to efficiently derive high-performing LLMs, surpassing zero-shot and few-shot baselines on commodity name classification tasks and yielding substantial improvements in click-through rate within a real-world ad-targeting system. Deployment in a real-world ad-targeting system showed a 50% relative improvement in Click-Through Rate (CTR) compared to a keyword-based baseline, demonstrating practical applicability. The authors identify several areas for future research, including investigating the reasons for variations in inferred positive data points across different categories, exploring the impact of category scarcity on the learning framework, and further analysis of oracle selection strategies.

Learning From Unlabeled Data With LLMs

This work pioneers a learning framework, LAUD, integrating large language models (LLMs) with active learning to derive task-specific LLMs from unlabeled datasets, effectively addressing the challenge of limited labeled data. The methodology overcomes the traditional cold-start problem of active learning by initializing the process with zero-shot learning from off-the-shelf LLMs, eliminating the need for pre-existing labeled sets or random sampling. Initially, the team employed zero-shot learning to predict labels for each data point, then selectively collected annotations only for those predictions exhibiting high confidence, ensuring a balanced initial labeled set without manual evaluation of every data point. The core of LAUD involves an iterative active learning loop, where a task-specific LLM is derived in each iteration using the accumulated annotations, including those from the initial phase. The team either fine-tuned the LLM if computational resources allowed, or leveraged in-context few-shot learning to transform a general LLM into a task-specific model. In each iteration, the team evaluates a stopping criterion to determine when the active learning loop can be terminated.

LLMs Learn From Unlabeled Data Efficiently

Scientists have developed a new learning framework, LAUD, which integrates large language models (LLMs) with active learning to create task-specific LLMs from unlabeled datasets. This work addresses a critical challenge in applying LLMs: the need for extensive labeled data, which is often expensive and time-consuming to obtain. LAUD overcomes this limitation by strategically selecting which data points require annotation, minimizing the overall annotation cost. The team demonstrated that LAUD effectively mitigates the “cold-start” problem, a common issue in active learning where initial performance is poor due to a lack of labeled examples.

The methodology begins by utilizing zero-shot learning with LLMs to create an initial labeled set from the unlabeled data, focusing on high-confidence predictions to ensure a balanced distribution. This initial set then fuels an iterative active learning loop where a task-specific LLM is derived through fine-tuning or in-context learning. Further validation of LAUD’s effectiveness occurred within a real-world ad-targeting system. Results demonstrate that the task-specific LLMs produced by LAUD yield substantial improvements in click-through rate (CTR), confirming the practical benefits of this approach. This breakthrough delivers a cost-effective method for adapting LLMs to specific tasks, reducing reliance on large labeled datasets and opening new possibilities for applying these powerful models in diverse applications.

Language Models Learn From Limited Data

This research presents a learning framework integrating large language models with active learning, termed LAUD, to improve performance on tasks with limited labeled data. The team successfully mitigated the challenges of a “cold start” by initially leveraging zero-shot learning capabilities within the language models. Experimental results demonstrate that language models refined using LAUD outperform those relying on traditional zero-shot or few-shot learning approaches, specifically in the context of commodity name classification. The impact of this work extends beyond laboratory evaluation, as demonstrated by a real-world deployment within an ad-targeting system.

Here, language models generated by LAUD achieved approximately a 50% relative improvement in click-through rates compared to existing keyword-based classifiers. This suggests a potential pathway for automating and scaling the annotation process, reducing reliance on human input. The authors acknowledge that further investigation is needed to understand how variations in data across different categories influence the interplay between language models, active learning, and the selection of optimal data for training. They also note the importance of exploring whether the scarcity of a category impacts the effectiveness of the proposed learning framework.

👉 More information
🗞 LAUD: Integrating Large Language Models with Active Learning for Unlabeled Data
🧠 ArXiv: https://arxiv.org/abs/2511.14738

Tags:

Active Learning commodity name classification Few-Shot Learning Large Language Models learning framework unlabeled dataset zero-shot learning

Laud: Integrating Large Language Models with Active Learning Overcomes Zero-shot Limitations, Enhancing Commodity Name Classification

LLMs and Active Learning for Data Efficiency

Learning From Unlabeled Data With LLMs

LLMs Learn From Unlabeled Data Efficiently

Language Models Learn From Limited Data

Rohail T.

Latest Posts by Rohail T.:

Quantum Light’s Wave-Particle Balance Now Fully Tunable

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy