Detecting vulnerabilities in computer code presents a significant challenge for even the most advanced artificial intelligence systems, despite recent progress in large language models (LLMs) for coding tasks. Researchers led by Md Abdul Hannan from Colorado State University, along with Ronghao Ni and Chi Zhang from Carnegie Mellon University, now investigate how to best ‘teach’ these LLMs to identify security flaws. Their work centres on ‘in-context learning’, where providing a few relevant examples dramatically improves performance, but crucially, the selection of those examples matters greatly. The team explores a novel approach, evaluating whether examples where the LLM itself makes mistakes, or those most similar to the code being examined, prove most effective in enhancing vulnerability detection across multiple open-source datasets, ultimately aiming to build more secure and reliable software.
Few-Shot Learning for Vulnerability Detection
This study pioneers a new approach to enhance code vulnerability detection using large language models through carefully chosen examples, a technique known as in-context learning. Researchers developed two algorithmic methods to construct these example sets, focusing on a model’s past performance and semantic similarity to the code under analysis. The first method, Learn-from-Mistakes (LFM), assesses a language model’s consistency when evaluating potential examples; examples where the model consistently errs are added to the example set, with the hypothesis that addressing these errors improves overall performance. Variations of LFM also explored adding examples where the model consistently performs correctly, aiming to reinforce desired behavior.
The second method, Learn-from-Nearest-Neighbors (LFNN), leverages code embedding models to quantify semantic similarity between potential examples and the code being tested. These models represent programs as vector representations, allowing researchers to calculate similarity using metrics like cosine similarity; the most similar examples are then added to the example set. To further refine performance, the team proposed three methods combining LFM and LFNN, creating unique algorithms that balance a model’s past mistakes with semantic relevance. Experiments employed Qwen2. 5-Coder-7B-Instruct, Gemma-3-4B-it, and GPT-5-mini, evaluating performance across four datasets containing code in C/C++, Python, and JavaScript. The study consistently measured precision, recall, and F1 score to assess the effectiveness of each method. Results indicate that individual strategies, particularly LFNN, significantly enhance baseline vulnerability detection capabilities, while the combined methods demonstrate a robust and balanced performance profile, optimizing both accuracy and F1-score across diverse models and datasets.
LLM Vulnerability Detection via Informed Example Selection
This work presents significant advances in detecting code vulnerabilities using large language models (LLMs) through improved in-context learning (ICL). Researchers developed new methods for selecting the most informative examples to include in the prompts given to LLMs, substantially enhancing their ability to identify security flaws in code. The core innovation lies in two algorithmic methods, Learn-from-Mistakes (LFM) and Learn-from-Nearest-Neighbors (LFNN), which guide the selection of these examples. LFM operates on the principle that LLMs learn best from examples they initially struggle with, adding samples to the prompt only if the model consistently makes errors on them.
Variations of LFM also explore adding examples where the model consistently performs correctly, reinforcing desired behavior. Experiments revealed that LFM introduces a strong bias, consistently favoring positive predictions, but provides valuable insights into model learning. LFNN, conversely, focuses on semantic similarity, adding examples that are most closely related to the code being analyzed, using code embedding models and cosine similarity to determine relatedness. Evaluations using Qwen2. 5-Coder-7B-Instruct, Gemma-3-4B-it, and GPT-5-mini across four datasets containing C/C++, Python, and JavaScript code demonstrate the effectiveness of these strategies.
Individual strategies, particularly LFNN, significantly improved baseline vulnerability detection capabilities. Combining LFM and LFNN yielded even more robust performance, optimizing both accuracy and F1-score across diverse models and datasets. These combined methods represent a balanced approach, leveraging both semantic similarity and the model’s learning history to enhance vulnerability detection. The results demonstrate a clear pathway toward deploying LLMs for practical software security applications.
Combining Example Selection Improves Vulnerability Detection
This work investigates methods for improving the performance of large language models in detecting code vulnerabilities. Researchers explored the effectiveness of selecting appropriate examples for in-context learning, a technique where models learn from a few demonstrations alongside the query. Two primary criteria were examined: choosing examples where the model consistently makes mistakes, and selecting examples most similar to the code being analyzed. Evaluations demonstrated that combining these criteria yields substantial improvements, particularly for open-source models. While these models initially perform less effectively than closed-source alternatives, the combined methods enable them to approach comparable levels of performance.
The research also revealed that simpler methods for example selection can struggle with certain datasets, highlighting the need for more robust strategies. Furthermore, the study suggests that selecting challenging examples tends to prioritize identifying all vulnerabilities, even at the cost of increased false positives. The authors acknowledge limitations inherent in current vulnerability datasets, including potential noise and duplication, and incorporated a carefully curated dataset to mitigate these concerns. They also note that function-level vulnerability detection, while a valuable first step, has inherent limitations compared to analyzing entire code repositories. Future work could focus on addressing these limitations and exploring more sophisticated methods for example selection and vulnerability analysis.
👉 More information
🗞 On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection
🧠 ArXiv: https://arxiv.org/abs/2510.27675
