Evaluation of 55 large language models on Maltese, a low-resource language, reveals poor performance, especially in generative tasks. Smaller, fine-tuned models often outperform larger counterparts, with pre-training exposure to Maltese proving critical. Fine-tuning, despite initial costs, delivers superior performance and reduced inference expenditure.
The challenge of applying advanced natural language processing (NLP) to languages with limited digital resources remains a significant hurdle in achieving genuinely inclusive technology. While large language models (LLMs) excel in high-resource languages, their performance diminishes when applied to those with fewer available training data. Researchers at the University of Malta, led by Kurt Micallef and Claudia Borg, address this issue in their work, ‘MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP’. They present a new benchmark, MELABenchv1, comprising 11 tasks designed to evaluate 55 publicly available LLMs on Maltese, a low-resource language, and compare their performance against smaller, specifically fine-tuned models. Their analysis reveals that prior exposure to Maltese during model training is a critical determinant of success, and that, contrary to current trends, fine-tuning smaller models can often outperform prompting larger ones, offering a more cost-effective solution for low-resource language processing.
Multilingual Model Performance on a Low-Resource Language
This study presents a comprehensive evaluation of 55 publicly available large language models (LLMs) applied to Maltese, a low-resource language, utilising a newly constructed benchmark encompassing 11 discriminative and generative natural language processing (NLP) tasks. Results demonstrate that many LLMs exhibit suboptimal performance, particularly on generative tasks, and smaller, fine-tuned models frequently outperform larger models across all tasks assessed. The research investigates multiple factors influencing model performance, revealing prior exposure to Maltese during pre-training and instruction-tuning as the most significant determinant of success.
Researchers established a robust methodology for evaluating LLMs on a low-resource language, addressing a critical gap in current research. The new benchmark comprises eleven diverse NLP tasks, encompassing both discriminative and generative challenges, to provide a comprehensive assessment of model capabilities. Discriminative tasks involve assigning labels to input data (e.g., sentiment analysis), while generative tasks involve creating new text (e.g., machine translation).
Performance analysis reveals that prior exposure to Maltese during pre-training and instruction-tuning constitutes the most significant determinant of model efficacy. Models incorporating Maltese within their training data consistently achieve higher scores, underscoring the importance of linguistic inclusivity in LLM development. This highlights a critical limitation of models trained predominantly on high-resource languages when applied to underrepresented linguistic contexts, and emphasises the need for greater diversity in training datasets.
Comparative analysis reveals a trade-off between fine-tuning and prompting strategies. While fine-tuning necessitates a greater initial investment of resources, it delivers superior performance and reduced inference costs. One-shot prompting – providing a single example to guide the model – with English instructions improves performance on generative tasks, but fine-tuning consistently yields more substantial gains, suggesting that task-specific adaptation remains a valuable approach.
Figures examining the relationship between model size, multilinguality, and performance indicate that both larger models and those trained on a greater number of languages generally perform better. However, the inclusion or exclusion of models specifically trained on Portuguese, Italian, and related data significantly alters observed trends, suggesting these models exhibit a distinct performance profile. This underscores the need for nuanced analysis when evaluating multilingual models and considering the specific language combinations involved, and highlights the potential for cross-lingual transfer learning – leveraging knowledge from one language to improve performance in another. Researchers observed that models trained on Romance languages demonstrated a particular advantage on Maltese, suggesting shared linguistic features facilitate knowledge transfer.
Further investigation explored the relationship between model size and zero-shot performance – the ability to perform a task without any specific training examples. Results reveal a non-linear correlation. Larger models generally exhibit improved performance, but the gains diminish beyond a certain scale, indicating that simply increasing model size does not guarantee sustained improvements. Similarly, increasing the number of languages a model is trained on yields benefits, but these plateau, indicating that simply expanding multilingual capacity does not guarantee sustained performance improvements. Researchers found that the optimal model size and multilingual capacity depend on the specific task and language, respectively.
The study suggests that a more targeted approach to model development, focusing on fine-tuning smaller models on specific languages and tasks, may be more effective than simply scaling up model size.
Researchers meticulously analysed the performance of different models across various NLP tasks, revealing key insights into their strengths and weaknesses. Discriminative tasks generally yielded higher accuracy scores than generative tasks, suggesting that LLMs are better at understanding and classifying existing text than at generating new text, particularly in low-resource settings.
The study advocates for more inclusive technologies and encourages researchers working with low-resource languages to consider established modelling techniques alongside the latest advancements in large language models. Researchers emphasise the importance of prioritising linguistic diversity in training datasets and developing models that are tailored to the specific characteristics of underrepresented languages. The study also encourages the development of open-source tools and resources that can facilitate research on low-resource languages.
Researchers acknowledge certain limitations of the study, including the relatively small size of the Maltese dataset and the limited scope of the NLP tasks evaluated. They also note that the study focused primarily on English-to-Maltese translation, and that further research is needed to explore other language pairs. Despite these limitations, the study provides valuable insights into the performance of LLMs on a low-resource language and offers practical recommendations for future research. Researchers plan to expand the dataset, evaluate a wider range of NLP tasks, and explore other language pairs in future studies.
The study concludes that fine-tuning smaller models on specific languages and tasks represents a promising approach to developing effective NLP solutions for low-resource languages. Researchers emphasise the importance of prioritising linguistic diversity in training datasets and fostering collaboration among researchers. By embracing a more targeted and inclusive approach to model development, we can unlock the full potential of NLP for all languages. The findings of this study have important implications for a wide range of applications, including machine translation, text summarisation, and information retrieval.
👉 More information
🗞 MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP
🧠 DOI: https://doi.org/10.48550/arXiv.2506.04385
