Ladfa Achieves Automated Privacy Policy Analysis of Personal Data Flows with Large Language Models

Understanding how organisations handle personal data remains a significant challenge for individuals, despite the crucial role privacy policies play in outlining these practices. Haiyue Yuan, Nikolay Matyunin, Ali Raza, and Shujun Li, from the University of Kent and Honda Research Institute Europe, address this issue with a novel framework called LADFA. Their research introduces a method for automatically analysing privacy policies, extracting complex data flows and representing them visually as a graph. This work advances the field by integrating large language models with retrieval-augmented generation and a specialised knowledge base, offering a more comprehensive and accurate approach than previous attempts. The resulting LADFA framework promises to facilitate deeper insights into data handling practices and is adaptable for use beyond the specific context of privacy policy analysis.

Extracting Data Flows From Privacy Policies

Privacy policies are fundamental to informing individuals about how organisations process their personal data, detailing collection, storage, and sharing practices. However, these policies are frequently rendered difficult to understand due to complex legal language and inconsistencies between sectors and organisations. To enable automated, large-scale analysis, researchers are increasingly turning to machine learning and natural language processing techniques, particularly large language models (LLMs). This work presents a significant advancement in this field with the development of LADFA, an end-to-end computational framework designed to extract personal data flows from privacy policies and construct comprehensive data flow graphs.

The research team achieved a breakthrough by combining the power of LLMs with retrieval-augmented generation (RAG) and a specifically curated knowledge base derived from existing studies. LADFA functions through a three-stage process: a pre-processor prepares the unstructured text of a privacy policy, an LLM-based processor extracts key data flow information, and a post-processor constructs a visual data flow graph for detailed analysis. This innovative approach moves beyond simple extraction, aiming to identify the actors involved, the attributes of the data, and the governing transmission principles, essential components for understanding contextual integrity as defined in established privacy theory. The study unveils a novel method for analysing privacy policies by focusing on comprehensive data flows, considering not only what data is collected and shared, but also with whom, and under what conditions.

Experiments demonstrate LADFA’s effectiveness and accuracy through a case study examining ten privacy policies from the automotive industry, a sector increasingly reliant on data collection from connected vehicles. The framework’s ability to construct data flow graphs facilitates insight discovery, offering a clearer understanding of an organisation’s data handling practices. LADFA distinguishes itself from prior work by integrating LLMs with RAG and a custom knowledge base, enabling the automated extraction of comprehensive data flows aligned with contextual integrity principles. This research establishes a robust and flexible framework, capable of analysing complex privacy policies and providing valuable insights into data processing practices. Beyond privacy policy analysis, the adaptable design of LADFA positions it as a versatile tool for a wide range of text-based analytical tasks, opening possibilities for applications in legal compliance, data governance, and security auditing.

LADFA Framework for Privacy Policy Data Flow Analysis

The research team engineered LADFA, an end-to-end computational framework designed to dissect and analyse privacy policies, moving beyond simple classification to detailed data flow analysis. This work addresses the challenge of comprehending lengthy and complex legal language by combining the power of large language models (LLMs) with Retrieval-Augmented Generation (RAG) and a bespoke knowledge base. Scientists developed this knowledge base by synthesising findings from multiple existing studies, enabling a nuanced understanding of data handling practices detailed within the policies. LADFA operates through a three-stage process beginning with a pre-processor that prepares unstructured text from privacy policies for analysis.

The core of the system employs an LLM-based processor, harnessing the capabilities of these models to extract personal data flows, identifying what data is collected, with whom it is shared, and for what purposes. This extracted information is then fed into a data flow post-processor, which constructs a personal data flow graph, visually representing the movement of information. The team validated the framework’s effectiveness through a case study examining ten privacy policies from the automotive industry. A key methodological innovation lies in the customisation of the knowledge base, differentiating this study from approaches reliant on single datasets like the OPP-115.

Researchers meticulously curated this resource, integrating taxonomies and insights from diverse sources to provide the LLM with a comprehensive understanding of privacy regulations and data practices. Experiments demonstrate that this approach enables more accurate analysis of modern privacy policies, which often deviate from the data used to train existing models. The system’s flexibility allows for application to a range of text-based analysis tasks beyond the scope of privacy policy evaluation. Further demonstrating the system’s capabilities, the study revealed 45 out of 47 incomplete issues against the General Data Protection Regulation (GDPR) when applied to a corpus of 24 privacy policies. Recent work utilising LLM-based privacy policy concept classifiers, employing both prompt engineering and LoRA fine-tuning techniques, achieved high performance, but this work extends beyond classification to reconstruct data flows. By focusing on the relationships between data elements, LADFA provides a more granular and actionable understanding of privacy practices, facilitating insight discovery and potentially identifying legal or reputational risks.

LADFA Extracts Data Flows From Policies

Scientists have developed LADFA, a novel end-to-end computational framework designed for automated analysis of privacy policies. The work centres on extracting personal data flows and constructing comprehensive data flow graphs, utilising large language models (LLMs) combined with retrieval-augmented generation (RAG) and a customised knowledge base. This framework consists of a pre-processor, an LLM-based processor, and a data flow post-processor, enabling detailed examination of unstructured text within privacy policies. Researchers demonstrated the effectiveness of LADFA through a case study involving ten privacy policies from the automotive industry, focusing on connected-vehicle mobile applications from different original equipment manufacturers.

Experiments revealed high levels of inter-rater reliability when validating LADFA’s outputs, with three domain experts assessing the framework’s performance. Gwet’s AC1 scores reached 0.94 for identifying data types and 0.96 for identifying data flows, while percentage agreement measured 0.82 and 0.86 respectively. These measurements confirm LADFA’s capability in accurately processing and understanding unstructured text, and extracting comprehensive data flows from complex privacy documentation. The team achieved average 7-Likert scores between 6 and 7 across most evaluation tasks, further demonstrating strong agreement with the framework’s results.

Data flow graph analysis, facilitated by LADFA, delivered insights into privacy and security-related issues often overlooked in manual reviews. The framework’s ability to construct and analyse these graphs allows for a deeper understanding of how personal data is collected, shared, and utilised. This breakthrough delivers a flexible and customisable tool suitable for a range of text-based analysis tasks, extending beyond the specific domain of privacy policy analysis. The research highlights a significant advancement in automating the process of understanding complex legal documents and uncovering critical data handling practices.

LADFA Automates Privacy Policy Data Flow Analysis

This work details the development of LADFA, an end-to-end framework designed to automate the analysis of privacy policies using large language models and retrieval-augmented generation. The framework effectively processes unstructured text, extracts personal data flows, and constructs data flow graphs to reveal insights regarding data handling practices. Through a case study involving ten privacy policies from the automotive industry, researchers demonstrated LADFA’s ability to accurately understand complex legal language and generate comprehensive data flows. LADFA’s significance lies in its potential to assist consumers in comprehending the often-opaque details of privacy policies, a task typically requiring considerable time and effort.

The framework’s modular design, incorporating a pre-processor, LLM-based processor, and data flow post-processor, allows for flexibility and adaptation to various text-based analysis tasks beyond privacy policy examination. The authors acknowledge limitations inherent in relying on LLMs, noting potential inaccuracies or biases in the generated outputs, and highlight the need for ongoing validation and refinement. Future research will focus on applying LADFA to diverse document types and analytical tasks, expanding its utility across different domains. The framework’s adaptable components, such as the text segmentation tool and knowledge bases, facilitate customisation for specific needs. This work represents a substantial contribution to the field of automated privacy analysis, offering a promising tool for enhancing transparency and accountability in data processing practices.

👉 More information
🗞 LADFA: A Framework of Using Large Language Models and Retrieval-Augmented Generation for Personal Data Flow Analysis in Privacy Policies
🧠 ArXiv: https://arxiv.org/abs/2601.10413

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Gravitational Wave Detectors Can Now Reveal Quantum States of Light and Gravity

Gravitational Wave Detectors Can Now Reveal Quantum States of Light and Gravity

February 12, 2026
Quantum Computing Boosts Rainforest Carbon Credit Portfolios by 31.6 Per Cent

Quantum Computing Boosts Rainforest Carbon Credit Portfolios by 31.6 Per Cent

February 12, 2026
Quantum Teleportation Circuits Become Dramatically Simpler with up to 36% Fewer Operations

Quantum Teleportation Circuits Become Dramatically Simpler with up to 36% Fewer Operations

February 12, 2026