Researchers are increasingly focused on the challenge of identifying personally identifiable information (PII) within user queries to maintain privacy in question-answering systems. Mariia Ponomarenko, Sepideh Abedini, and Masoumeh Shafieinejad, from the University of Waterloo and Vector Institute, alongside D. B. Emerson, Shubhankar Mohapatra, Xi He et al., present a novel approach, CAPID, to address the limitations of current PII redaction techniques. This work is significant because it moves beyond simple PII removal by developing a system that discerns contextually relevant PII, thereby improving response quality while safeguarding user privacy. The team achieves this through a fine-tuned small language model (SLM) and a newly generated synthetic dataset designed to capture the nuanced relevance of PII, demonstrably outperforming existing methods in accuracy and downstream utility.

Contextual PII redaction using synthetic data for improved question answering

Researchers have developed a new approach to protecting user privacy in question-answering systems by selectively redacting personally identifiable information (PII). Current privacy tools often remove all PII, potentially hindering the quality of responses, while this work introduces CAPID, a system designed to discern which PII is contextually relevant to a user’s query.
CAPID employs a locally-owned small language model (SLM) to filter sensitive information before it reaches larger language models used for generating answers, ensuring data remains within a secure environment. A key challenge in creating such a system lies in the lack of suitable training data, as existing datasets fail to capture the nuanced, context-dependent relevance of PII.

To overcome this limitation, the team constructed a synthetic dataset leveraging the capabilities of large language models to generate a diverse and comprehensive collection of examples. This dataset spans multiple PII types and varying levels of relevance, enabling the fine-tuning of an SLM to accurately detect PII spans, classify their types, and crucially, assess their contextual importance.
Experiments demonstrate that this relevance-aware PII detection, powered by the fine-tuned SLM, significantly surpasses existing baseline methods in span, relevance, and type accuracy. Notably, the research reveals that anonymization using CAPID preserves a substantially higher degree of downstream utility compared to traditional methods.

Evaluations, conducted using an LLM-as-a-judge approach and real user queries from Reddit, confirm that retaining contextually relevant PII leads to more informative and accurate responses. The team has made both the dataset and the model openly available, facilitating further research and development in privacy-preserving question-answering technologies. This innovation promises to balance robust privacy protection with the delivery of high-quality, contextually appropriate answers in conversational AI systems.

Construction and application of a context-aware personally identifiable information dataset

A synthetic data generation pipeline underpinned this work, leveraging large language models to create a diverse dataset encompassing multiple PII types and relevance levels. This pipeline addressed the existing gap in resources for training models to discern context-dependent PII relevance. The generated dataset, named CAPID, was specifically designed to support fine-tuning and evaluation of context-aware models capable of reasoning about both the presence and appropriate masking of PII within question-answering tasks.

Researchers then fine-tuned small language models, specifically Llama-3.1-8B and Llama-3.2-3B, to detect PII spans, classify their types, and estimate contextual relevance using the CAPID dataset. The models were trained to identify not only whether a span contained PII, but also to assess its relevance to the user’s question.

Performance was evaluated by comparing the accuracy of these fine-tuned SLMs against GPT-4.1-mini in classifying PII relevance, demonstrating an improvement from 0.68 to 0.79. To assess downstream utility, experiments were conducted using real user queries collected from Reddit. These queries were subjected to different masking strategies, and the resulting LLM-generated answers were evaluated using an LLM-as-a-judge approach.

This methodology allowed for a quantitative comparison of how relevance-aware anonymization, achieved with the fine-tuned SLM, preserved more downstream answer utility compared to existing anonymization baselines. The resulting code and dataset were made openly available to facilitate further research in this area.

Relevance-enhanced PII detection via synthetic data and small language model fine-tuning

A new approach to privacy-preserving personally identifiable information (PII) detection achieves substantial improvements in span, relevance, and type accuracy. This work introduces CAPID, a system that fine-tunes a small language model (SLM) to filter sensitive information before it is processed by larger language models for question answering.

Experiments demonstrate that relevance-aware PII detection using the fine-tuned SLM significantly outperforms existing baseline methods while maintaining higher downstream utility after anonymization. The research addresses a critical gap in existing datasets by proposing a synthetic data generation pipeline.

This pipeline leverages larger language models to create a diverse and domain-rich dataset encompassing multiple PII types and varying levels of relevance. The generated dataset was then used to train the SLM to not only detect PII spans and classify their types, but also to estimate contextual relevance, a key component of the improved performance.

Specifically, the fine-tuned SLM demonstrates superior performance in accurately identifying PII spans, classifying their types, and assessing their contextual relevance. This capability allows for more nuanced anonymization, preserving information crucial for response quality in question-answering systems.

The synthetic data generation process utilizes parameters such as a temperature of 1.0 and top_p of 1.0 during model interactions via the OpenAI Responses API. The study employed gpt-5-chat-latest for synthetic data generation and reasoning evaluations, and gpt-4.1-mini for question answering during utility evaluation.

Prompt templates were designed to generate topics, subtopics, and realistic situations incorporating PII, ensuring the creation of a comprehensive and challenging dataset for training and evaluation. The generated situations specifically include “I” statements to simulate personal experiences and maintain a sentence length between 20 and 35 words.

Selective PII masking enhances utility in downstream natural language processing

A context-aware approach to personally identifiable information (PII) detection and anonymization has been developed, addressing limitations in existing systems that indiscriminately mask all personal information. This method models both the identification of PII spans and an assessment of whether each attribute is essential for downstream task performance, enabling selective preservation of high-relevance information while masking unnecessary data.

Experiments utilising both synthetic and naturally occurring data from Reddit demonstrate substantial performance improvements in span detection, type assignment, and relevance classification when employing a fine-tuned language model compared to established baselines. Relevance-aware masking consistently yields higher answer utility than complete anonymization, indicating that retaining contextually important PII can significantly improve model performance when justified by the need to accurately complete a downstream task.

The system achieves a 22% improvement in utility on Reddit and a 28% improvement on a dedicated CAPID test set, demonstrating its effectiveness in preserving information crucial for question-answering. Current models encounter difficulties with very long sequences, and the accuracy of relevance estimation diminishes as context length and information density increase.

While techniques like chunking or summarization may offer mitigation, enhancing long-context reasoning remains a key area for future development. Furthermore, relevance prediction is more accurate when questions contain linguistic cues indicating informational needs, and becomes more challenging in neutral formulations lacking such hints.

The framework currently operates in a domain-agnostic manner, and adapting it to incorporate domain-specific rules and expert knowledge would broaden its applicability in specialised fields. Finally, while the system reduces PII exposure to large language models, it currently employs binary sensitivity scoring, and extending this to continuous scores could further limit privacy leakage.

👉 More information
🗞 CAPID: Context-Aware PII Detection for Question-Answering Systems
🧠 ArXiv: https://arxiv.org/abs/2602.10074

Tags:

small language models synthetic data generation

AI Now Protects Your Questions Without Losing Crucial Information

Contextual PII redaction using synthetic data for improved question answering

Construction and application of a context-aware personally identifiable information dataset

Relevance-enhanced PII detection via synthetic data and small language model fine-tuning

Selective PII masking enhances utility in downstream natural language processing

Rohail T.

Latest Posts by Rohail T.:

Fermionic Systems’ Efficient Calculations Now Possible with New Equations

Fewer Measurements Unlock More Precise Quantum Sensing Techniques

Calculations Bound Quantum System Energies for up to Ten Particles