Scientists are increasingly focused on the early identification of security bug reports to minimise potential vulnerabilities. Farnaz Soltaniani, Shoaib Razzaq, and Mohammad Ghafari, all from Technische Universität Clausthal Germany, have evaluated the effectiveness of both utilising pre-trained Large Language Models (LLMs) and fine-tuning these models for predicting such reports. Their research highlights a crucial trade-off: while proprietary LLMs demonstrate superior sensitivity in identifying bug reports, achieving a G-measure of 77% , this comes at the expense of accuracy, with a precision of only 22%. Conversely, fine-tuned models offer higher precision (75%) but reduced recall (36%). This work is significant because it demonstrates that, despite the initial investment, fine-tuned models can offer substantially faster inference speeds, up to 50times quicker on large datasets, suggesting a viable pathway for efficient and accurate security vulnerability prediction.
The research team assessed both prompt-based approaches utilising proprietary models and fine-tuning techniques with smaller, open-source LLMs to determine the most effective strategy for identifying security risks within bug reports.
Findings reveal a clear trade-off between these two methods, with proprietary models exhibiting high sensitivity, achieving an average G-measure of 77% and recall of 74% across datasets, but suffering from a significantly higher false-positive rate, resulting in a precision of only 22%. Conversely, fine-tuned models demonstrated greater precision, attaining 75% on average, but at the expense of recall, which measured only 36%, and a lower overall G-measure of 51%.
Although building fine-tuned models requires initial investment, inference on the largest dataset was shown to be up to 50times faster than with proprietary models. This speed advantage offers a practical benefit for large-scale vulnerability assessment. The study utilised datasets including Chromium, Derby, Camel, Ambari, and Wicket, with varying numbers of bug reports and proportions of security-related issues.
Experiments involved prompting proprietary LLMs, specifically GPT and Gemini, with bug report descriptions and requesting identification of security implications. Gemini consistently prioritised sensitivity, achieving a recall of 0.74, while GPT adopted a more conservative approach with a recall of 0.23.
Further investigation focused on fine-tuning encoder-only models like BERT and DistilBERT, and decoder-only models such as DistilGPT and Qwen, adapting them for bug report prediction. DistilBERT consistently outperformed other fine-tuned models, achieving a G-measure of 0.51, though still lagging behind the best proprietary model performance.
These results highlight the need for continued research into harnessing the power of LLMs for SBR prediction, particularly in identifying factors that contribute to successful classification and balancing sensitivity with precision to minimise both false positives and missed vulnerabilities. The research team utilised five established datasets, Chromium, Ambari, Wicket, Derby, and Camel, detailed in Table I, each containing bug report features including ID, description, and security status.
They consolidated bug descriptions and summaries into a single text field for each report, streamlining the input for the LLMs. To address the temporal mismatch common in real-world scenarios, the study pioneered a sequential dataset partitioning method. Each dataset was sorted chronologically and divided into two equal halves; the first half served as the training set, comprising historical bug reports, while the second half constituted the test set, representing future reports.
This approach simulates the predictive task more realistically than random train-test splits. Researchers then trained and evaluated both proprietary LLMs and fine-tuned models on this temporally-split data. These results, however, came at the cost of a precision rate of only 22%, indicating a higher false-positive rate.
The work demonstrates a clear trade-off between sensitivity and precision when employing these models for vulnerability mitigation. Experiments revealed that fine-tuned LLMs exhibited contrasting behaviour, attaining a lower overall G-measure of 51% but achieving substantially higher precision of 75%, alongside a reduced recall of 36%.
Though requiring an initial investment in model building, inference on the largest dataset with fine-tuned models proved up to 50times faster than with proprietary models. This speed advantage could be crucial for real-time security monitoring and rapid response. Data shows the datasets used in the study comprised varying proportions of security bug reports; for example, the Chromium dataset contained 41,940 bug reports, of which 808 were SBRs.
The team partitioned each dataset chronologically, using the first half for training and the second half for testing to simulate real-world temporal mismatches. Table II details the distribution of SBRs and non-SBRs within the training and testing sets for each project, with Chromium’s training set containing 371 SBRs and 20,599 non-SBRs.
Measurements confirm that the best fine-tuned model achieved 53% higher precision than the proprietary models, while also offering significantly lower inference latency. The study employed Gemini and GPT, accessing them via API for seamless integration and utilising their reasoning capabilities. The study reveals a trade-off between these two approaches, with proprietary models like Gemini achieving a higher G-measure of 77% but exhibiting a substantial false positive rate and increased inference latency.
Conversely, fine-tuned models, such as DistilBERT, demonstrated a lower G-measure of 51% but significantly improved precision of 75%, alongside faster inference speeds, up to 50times quicker than proprietary models. This work highlights the potential of LLMs for SBR prediction, while also demonstrating the contrasting strengths of prompted versus fine-tuned models.
Proprietary models offer higher sensitivity to SBRs, but at a cost, while fine-tuned models prioritise precision and computational efficiency. The authors acknowledge limitations including the dataset-specific nature of optimal parameters and the need for replication across diverse projects to ensure generalizability. Future research should focus on understanding the reasons for the limited performance observed and exploring alternative model variants to further enhance SBR prediction accuracy.
👉 More information
🗞 Evaluating Large Language Models for Security Bug Report Prediction
🧠 ArXiv: https://arxiv.org/abs/2601.22921
