Tamazight, a language spoken by millions across North Africa, has historically been underrepresented in digital technologies, creating a significant barrier to access and preservation. Alp Öktem from Col·lectivaT and Farida Boudichat from the Awal Team now present a detailed analysis of efforts to address this imbalance through Awal, a community-powered initiative launched in 2024. Their work examines the challenges of building computational resources for Tamazight, and reveals valuable insights into the complexities of engaging speakers in language data creation. The research demonstrates that while there is strong enthusiasm for digitalising Tamazight, standard crowdsourcing methods face limitations due to factors like confidence in written forms and ongoing standardisation debates, ultimately impacting the scale of contributed data. This study provides crucial guidance for future initiatives aiming to empower under-resourced languages in the digital age and builds towards improved machine translation models using the collected resources.
This initiative employs a community-driven approach, focusing on collecting data, creating language corpora, and developing potential machine translation models through participatory research. The project centers on Amazigh communities in Catalonia and North Africa, recognizing the importance of local involvement. A core challenge lies in the scarcity of available data, compounded by issues of data quality, inconsistencies, and variations in writing systems, impacted by the ongoing process of standardizing Amazigh and its dialectal differences.
The project explicitly addresses the need to decolonize Natural Language Processing practices, ensuring that language technology benefits Amazigh communities and does not perpetuate existing inequalities. Understanding the social and cultural context of the language is paramount, recognizing the importance of dialectal variations and community involvement. Ethical considerations, such as data ownership, privacy, and potential misuse of technology, are central to the project’s approach, emphasizing participatory research, data justice, and the preservation of cultural diversity through language technology. Successful development requires both technical skills and a deep understanding of the language’s social and cultural context, prioritizing quality over quantity in data collection.
Community-Driven Tamazight Translation Data Collection
Launched in 2024, the Awal project directly addresses the scarcity of digital resources for Tamazight through a community-driven initiative focused on collecting both translation and voice data. Initial efforts involved manual collection of translated phrases, but the team quickly transitioned to a platform-based approach to enable broader participation and scale data collection for machine translation development. The awaldigital. org platform serves as a central hub where users can access project information and utilize an integrated machine translation application, contributing translations through a dedicated interface allowing bidirectional translation between Tamazight and languages including Catalan, Spanish, French, Moroccan Arabic, and English.
To enhance efficiency, the platform features a “Pre-translate” option that automatically translates source text, requiring users to then correct and refine the translation before submission, creating a post-editing workflow. A gamification system awards points for character input, fostering friendly competition and encouraging sustained participation, with users able to track their ranking on a leaderboard. Quality control is maintained through a peer-validation system, where users review translations submitted by others, assessing meaning, fluency, and grammatical accuracy, with two approvals required to move entries into a validated corpus. The team acknowledges Tamazight’s dialectal diversity by categorizing contributions into five variants, avoiding strict standardization and welcoming contributions from all speakers, complementing translation efforts with Mozilla’s Common Voice platform for voice data collection, requiring translation of the platform interface into Tamazight and the creation of a relevant sentence collection for recording.
Tamazight Language Data, Community Engagement, Challenges
The Awal initiative represents a significant step towards addressing the underrepresentation of Tamazight in digital spaces, launching in 2024 as a community-powered platform for language resource development. This work involved a detailed review of the current landscape of computational resources for Tamazight, identifying a critical need for community-driven approaches to overcome persistent data scarcity. The platform enables speakers to directly contribute to the creation of translation and voice data, fostering a collaborative environment for language preservation and technological advancement. An 18-month analysis of community engagement revealed both positive reception and key challenges, including limited confidence in written Tamazight and ongoing complexities surrounding standardization.
Despite these challenges, the project successfully collected 6,421 translation pairs and 3 hours of speech data, demonstrating the potential of community involvement, even within complex sociolinguistic contexts. The collected data is now being utilized to develop improved open-source machine translation models, paving the way for more accurate and accessible language technologies. This research highlights the importance of participatory approaches in low-resource language technology, acknowledging the need to address linguistic diversity and empower communities to shape their digital futures. The team is actively working to refine methods for encouraging broader participation, recognizing that sustained engagement requires addressing both technical and cultural barriers.
Tamazight Data Collection, Community Engagement, Challenges
The Awal initiative represents a significant step towards addressing the digital underrepresentation of Tamazight, a language historically lacking in computational resources. The project successfully established a platform and gathered a foundational dataset comprising over six thousand translation pairs and three hours of speech data, demonstrating the feasibility of community-driven language technology development for Tamazight. However, the study highlights that achieving broad participation requires careful consideration of existing linguistic barriers and community dynamics. Researchers found that limited confidence in written Tamazight, coupled with ongoing debates surrounding standardization and dialect representation, significantly constrained data contribution from the wider public.
While the initiative garnered positive reception, the majority of contributions originated from linguists, activists, and those already engaged with language preservation efforts. The team identified micro-translation tasks as a promising approach to overcome creative barriers and encourage wider participation, and are currently developing improved machine translation models using the collected data. Future work will need to address the tensions between linguistic diversity and standardization, and focus on building confidence in written Tamazight within the broader community to unlock the full potential of collaborative language technology development.
👉 More information
🗞 Awal — Community-Powered Language Technology for Tamazight
🧠 ArXiv: https://arxiv.org/abs/2510.27407
