Semantic Clustering of Civic Proposals Enables Actionable Data from Citizen Input at Scale

Governments increasingly rely on digital platforms to gather citizen input, yet effectively organising this vast volume of contributions remains a major challenge. Ronivaldo Ferreira from Universidade Federal do Pará, Guilherme da Silva and Carla Rocha from Universidade de Brasília, along with Gustavo Pinto, address this problem by presenting a new method for automatically clustering civic proposals submitted through Brazil’s National Participation Platform. Their approach combines a powerful language model with carefully chosen seed words and an automated validation process, allowing the system to generate coherent and relevant topics with minimal human intervention. This achievement transforms unstructured citizen input into actionable data, offering governments a scalable solution for incorporating public perspectives into public policy and maximising the value of digital participation initiatives.

Topic Modeling of Citizen Governance Proposals

This research addresses the challenge of automatically classifying and understanding public proposals submitted through the Brasil Participativo platform, a Brazilian participatory governance platform. The goal is to transform these citizen contributions into actionable insights for policymakers. The core problem lies in efficiently processing a large volume of text data, identifying key themes, and aligning those themes with existing governmental structures and priorities. Researchers employed topic modeling, specifically using the BERTopic technique, to achieve this. Researchers experimented with different pre-trained language models, known as embeddings, to best represent the text data, including BERTimbau, LaBSE, LiBERT-SE, and GovBERT-BR, a model specifically trained on Brazilian governmental data.

A crucial aspect of the methodology involved incorporating domain knowledge through the use of seed words, keywords extracted from a governmental vocabulary, to guide the topic modeling process and ensure alignment with existing categories. Large Language Models (LLMs) were also used to assist in validating and labeling the generated topics, reducing the need for manual effort. The results demonstrate that the choice of embedding significantly impacts the quality of the topic modeling, with GovBERT-BR showing particularly promising results. Incorporating seed words from the governmental vocabulary proved crucial for aligning the generated topics with existing structures and ensuring relevance.

LLMs effectively reduced the manual effort required for topic validation and labeling. The team successfully developed a functional pipeline that can automatically classify public proposals and provide actionable insights for policymakers. This research offers a practical solution for processing citizen contributions and improving the efficiency of participatory governance. The automated pipeline can handle a large volume of text data, making it suitable for large-scale participatory platforms. The findings highlight the importance of using domain-specific language models for tasks involving specialized terminology.

This research demonstrates the effectiveness of integrating domain knowledge into topic modeling processes and is adaptable to other contexts. This work builds on previous research applying Natural Language Processing (NLP) to governmental documents and participatory platforms, such as studies utilizing Legal-BERT for legal document analysis and the LiPSet dataset of Brazilian public bidding documents. The researchers suggest a cycle of feedback from human experts to refine the seed words and parameters of the model, ensuring its continued relevance. Manual validation remains necessary for cases where the model has low confidence in its predictions. Ultimately, this research presents a practical and effective pipeline for automatically classifying and understanding public proposals, leveraging the power of NLP and domain knowledge to improve the efficiency of participatory governance.

Citizen Proposals Classified Using Automated Topic Modeling

This work presents a novel methodology for organizing and interpreting the substantial volume of citizen input received through the Brasil Participativo platform, a digital initiative designed to gather public contributions for the national Plano Plurianual (PPA). Recognizing the challenges of manually classifying the large number of contributions, researchers developed an automated pipeline based on the BERTopic model. This approach aims to transform raw citizen input into actionable data for public policy formulation. The team investigated strategies to enhance the performance of BERTopic, including the incorporation of “seed words” extracted from institutional vocabulary and the use of a Large Language Model (LLM) for automatic topic validation.

These semi-supervised techniques guide the model to align with official taxonomies and reduce the need for manual intervention. Researchers measured the quality of the resulting topic models using metrics focused on semantic coherence, thematic diversity, and alignment with established institutional categories. Experiments demonstrate the effectiveness of this approach in processing large-scale citizen contributions. The developed pipeline successfully organizes the data, enabling a more efficient and consistent categorization of proposals. By leveraging natural language processing techniques, this work provides a scalable solution for governments seeking to harness the power of digital participation and incorporate citizen input into policy-making processes, paving the way for improved transparency, responsiveness, and legitimacy in governmental actions.

Citizen Feedback Organised With BERTopic Models

This research presents a novel methodology for automatically organizing large volumes of text generated through digital participation platforms, such as Brasil Participativo. By combining the BERTopic model with carefully selected seed words and an automated validation process, the team successfully demonstrates a system capable of identifying coherent and institutionally aligned topics within citizen contributions. This approach addresses a significant challenge for governments seeking to utilize public input effectively.

👉 More information
🗞 Semantic Clustering of Civic Proposals: A Case Study on Brazil’s National Participation Platform
🧠 ArXiv: https://arxiv.org/abs/2509.21292

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Machine Learning Achieves Cloud Cover Prediction Matching Classical Neural Networks

Quantum Machine Learning Achieves Cloud Cover Prediction Matching Classical Neural Networks

December 22, 2025
Nitrogen-vacancy Centers Advance Vibronic Coupling Understanding Via Multimode Jahn-Teller Effect Study

Nitrogen-vacancy Centers Advance Vibronic Coupling Understanding Via Multimode Jahn-Teller Effect Study

December 22, 2025
Second-order Optical Susceptibility Advances Material Characterization with Perturbative Calculations

Second-order Optical Susceptibility Advances Material Characterization with Perturbative Calculations

December 22, 2025