Kunal Kotian, Indranil Bhattacharya, Shikhar Gupta, Kaushik Pavani, Naval Bhandari, & Sunny Dasgupta (2023). TAAL: Target-aware active learning. In KDD 2023 Workshop on e-Commerce and NLP (ECNLP 6).
Traditional active learning techniques have been proven effective in generating accurate classifiers with minimal labeling requirements compared to random sampling. For instance, fraud detection teams building classifiers can benefit from active learning to find suspicious events for investigation, which when labeled increase the overall classifier performance the most. However, it might have a problem: the standard active learning methods overlook class-specific business targets . Consider a fraud detection classifier that is doing well on catching card-not-present fraud and fake accounts, but not on account takeovers. This can happen if the training data collection is agnostic to different fraud types and is especially harmful if account takeovers cause more monetary harm than other fraud types. This problem arises because the standard active learning methods do not care about which class is being labeled, as long as labeling an unknown event leads to a better performance of the overall classifier. The result of this problem is low precision/recall on minority classes.
This is where the idea of target-aware active learning steps in. This framework transforms any active learning strategy into "target-aware", by considering the gap between each class's current estimated accuracy and its corresponding business target. By sampling data points from fraud categories where the model is underperforming (e.g. account takeovers) , target-aware active learning can ensure efficient allocation of annotation resources (investigators) while achieving performance metrics for all fraud types.
Building Responsible AI
Ethical AI development necessitates ensuring fair and inclusive representation across all data classes. This is especially crucial in tasks like loan approvals, where underrepresented data points can lead to biased decisions with negative societal impacts.
Imagine an AI system used for loan approvals that classifies applicants into multiple categories based on loan type (e.g., mortgage, auto loan, small business loan). Let's say the system performs very well for common loan types requested by individuals with high credit scores and stable financial backgrounds (represented by the majority class in the training data). However, the system underperforms for less frequent loan types, like small business loans requested by first-time entrepreneurs or minority-owned businesses (represented by the minority class).
The lower performance of AI based loan approver on less frequent loan types can have negative societal implications, leading to limited access to capital for minority groups and distrust for aspiring entrepreneurs on the financial system. AI systems, often being opaque, make it difficult for loan applicants to understand why their application was rejected. Therefore, a multi-class loan approval classifier that misses good performance on a minority class can raise serious ethical concerns.
Target-Aware Active Learning: A Solution: Target-aware active learning can offer a powerful solution to this challenge. By mitigating bias in data selection, it actively prioritizes underrepresented classes for human labeling. This ensures a more balanced training set that avoids favoring common patterns over potentially risky but less frequent ones. The result is a system that learns from a broader range of data points, promoting fairer and more inclusive outcomes in loan approvals and various other applications.
Q/A Session With Shikhar
Q: How does target-aware active learning improve traditional uncertainty sampling in imbalanced datasets?
A: Traditional uncertainty sampling (a common active learning approach) focuses on identifying the most confusing samples overall, neglecting the specific classes that might benefit the most from labeling. Target-aware active learning addresses this by considering the target precision for each class and prioritizing labeling confusing samples that are predicted to benefit lagging classes with minimum labeled data. This helps improve the model's performance on under-represented classes in imbalanced data. The good part about this framework is that it is plug and play with any active learning algorithm. The framework's greatest benefit lies in settings with constrained annotation budgets. By prioritizing informative samples for under-represented classes, it aims to achieve good performance with fewer labels, reducing the overall annotation cost.
Click here to listen to the detailed answer
Q: Beyond classification, what other potential applications are there for this framework?
A: The framework can be generally applicable to any scenario where data collection is expensive and there's an inherent imbalance across the classes of interest. This could be relevant in various domains beyond what we discussed in the paper.
Medical diagnosis: In medical imaging tasks like chest X-ray analysis, certain diseases are much rarer than others. Standard active learning can improve overall model accuracy of detecting a disease from an X-ray, but it might neglect rare diseases. Target-aware active learning can help prioritize these rare diseases for radiologist review. Large language models (LLMs): During the "human-in-the-loop" phase of LLM training, where humans refine the model's outputs, data collection can be expensive. Target-aware active learning can be used to prioritize the selection of prompts for human evaluation, focusing on those that would benefit the model's performance across all categories. For instance, in LAMM: Language aware active learning for multilingual models, the researchers show that a same active learning strategy can help achieve accuracy on low resource languages with limited training data. This can help in ensuring multilingual models are performant on all languages of operation. Fraud detection: Fraudulent events are inherently rare compared to legitimate transactions. Target-aware active learning could be applied to select the most informative transactions for human review, improving fraud detection accuracy for each fraud type, while reducing the cost of manual investigation. Ethical considerations in training data: The framework could be adapted to incorporate ethical dimensions into data selection. By considering factors like potential biases in the training data, it could help ensure fairer and more responsible model development.
Click here to listen to the detailed answer
Q: Is there a publicly available library for the target-aware active learning framework?
A: Shikhar's team did not create a public library for their specific framework. However, they acknowledge the existence of other repositories like distil: https://github.com/decile-team/distil that offer implementations of base active learning algorithms. This publicly available resource potentially served as a foundation for their research and experimentation.
Q. Does the potential of using large language models (LLMs) as annotators render active learning obsolete?
A: There is an interesting dynamic between LLMs and active learning:
- Active learning aims to reduce human annotation costs by selecting the most informative data points for labeling.
- LLM as annotator has been explored in some research works, including LLMaAA: Making Large Language Models as Active Annotators and AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators. LLMs, if capable of providing high-quality labels, could potentially eliminate the need for human annotation altogether. In this scenario, the limiting factor would be the computational cost of running LLM inference, not the annotation budget.
However, there are challenges:
- Current LLMs are not perfect annotators and their accuracy can vary across different tasks.
- There is ongoing research exploring the use of active learning at the foundational model stage (training on internet data) to improve the subsequent fine-tuning process with human annotations or LLM-generated labels.
Therefore, while LLMs offer an exciting prospect for reduced human effort, active learning might still play a role in specific stages or when LLM performance needs improvement. The field is expected to see further advancements and research in the coming months, potentially reshaping the landscape of LLM training.
Q: Concluding our discussion, what broader societal implication surrounding active learning would you like to emphasize?
A: The need for responsible data collection practices in the field. While the technical aspects of active learning are crucial, it's equally important to consider its broader societal implications. We must be mindful of the potential biases inherent in datasets and how they can lead to unfair outcomes for certain groups. By actively addressing these biases during data collection and development, we can ensure that active learning contributes to a more equitable and ethical future for all.
As we wrap up this edition, I'd like to thank you all for being readers. In the coming months, I'll be starting to share related job opportunities in the industry and highlighting products and tools you can use in your everyday work. Stay tuned!
If you found this discussion informative, please consider forwarding it to your teammates, friends, or anyone you think might benefit from it. Sharing knowledge is key to fostering a vibrant and collaborative community!