Researchers are increasingly turning to Large Language Models (LLMs) to automate and assist with software development, particularly in code generation. However, a significant gap exists in understanding how to best instruct these models to produce reliable and effective code, developers lack concrete guidelines for crafting effective prompts. Now, Alessandro Midolo (University of Catania), Alessandro Giagnorio (USI Università della Svizzera italiana), Fiorella Zampetti (University of Sannio), et al., present an empirical characterisation of prompt engineering for LLMs, deriving ten actionable guidelines through rigorous, test-driven refinement. This work is significant because it moves beyond simply demonstrating LLM potential, offering developers practical, evidence-based advice on specifying inputs, clarifying requirements, and ultimately improving the quality of generated code , with surprising insights into the disconnect between perceived usefulness and actual implementation.
The team employed an innovative, test-driven approach, automatically refining prompts through iterative cycles and meticulously analysing the resulting improvements that led to successful test outcomes. This process allowed them to identify key elements crucial for effective prompt engineering, ultimately eliciting a set of ten actionable guidelines focused on enhancing clarity and detail.
The study began with code generation tasks sourced from three established Python benchmarks, BigCodeBench, HumanEval+, and MBPP+, and initially tested prompts with four state-of-the-art LLMs: GPT-4o-mini, Llama 3.3 70B Instruct, Qwen2.5 72B Instruct, and DeepSeek Coder V2 Instruct. When these LLMs consistently failed to generate code passing the benchmark tests, the researchers implemented an automated refinement process, iteratively modifying the prompts until test success was achieved. Through careful manual analysis of the initial and refined prompts, alongside detailed test logs, they identified textual elements consistently added during the refinement process, forming the basis of their ten prompt improvement dimensions. These dimensions encompass crucial aspects such as better specification of input/output formats, pre- and post-conditions, illustrative examples, and clarification of ambiguities.
To validate these guidelines, the researchers conducted an assessment involving 50 software development practitioners, gathering data on their current usage of prompt optimisation techniques and their perceptions of the usefulness of the newly proposed guidelines. Results revealed a varied adoption of different patterns, with participants frequently refining I/O formats and conditions, but less inclined to use “by example” approaches or linguistic improvements. Interestingly, the practitioners perceived the guidelines related to adding I/O examples as particularly useful, despite reporting less frequent current usage, suggesting a potential for increased adoption with greater awareness. This work not only provides valuable insights for developers and educators but also lays the groundwork for creating more intelligent LLM-aided software development tools. The implications of this research extend beyond immediate practical application, offering a pathway towards automated recommendation systems capable of identifying missing elements within a prompt and suggesting targeted improvements based on the specific context of the task0.3 70B Instruct, Qwen2.5 72B Instruct, and DeepSeek Coder V2 Instruct, and deliberately identified prompts consistently failing benchmark test cases. Subsequently, an automated system iteratively modified these prompts until the LLMs generated test-passing code, establishing a robust foundation for analysis. Researchers then meticulously analysed the initial and refined prompts alongside corresponding test logs, pinpointing specific textual elements added during the automated refinement process.
This detailed examination yielded a taxonomy of 10 code generation prompt improvement dimensions, encompassing aspects like I/O specification, pre- and post-conditions, illustrative examples, and clarification of ambiguities. The study pioneered a method for systematically deriving actionable guidelines directly from successful prompt revisions, moving beyond general prompt engineering advice. To validate these guidelines, the team conducted a survey involving 50 practitioners from their professional network, assessing both their current usage of prompt optimisation patterns and their perceived usefulness. Participants reported their adoption of various techniques, revealing a tendency to refine I/O formats and conditions, but less frequent use of “by example” approaches or linguistic improvements.
Interestingly, practitioners rated patterns they already used as highly useful, but also expressed significant potential in less frequently employed techniques, such as providing I/O examples. This process yielded a set of ten guidelines focused on enhancing prompt clarity, specifically regarding I/O specification, pre- and post-conditions, illustrative examples, detailed information, and the resolution of ambiguities. Experiments revealed that these guidelines significantly impact the effectiveness of LLM-aided software development.
The team conducted a comprehensive assessment involving 50 practitioners, recording their existing usage of prompt improvement patterns and their perceptions of usefulness. Results demonstrated a discrepancy between perceived usefulness and actual usage prior to exposure to the newly developed guidelines, highlighting a gap in developer awareness. Data shows that participants frequently refined I/O formats and pre/post conditions, but less often employed “by example” approaches or linguistic improvements. However, they consistently rated the addition of I/O examples as particularly useful, even if they didn’t routinely implement it.
Researchers utilised code generation tasks from three Python benchmarks, BigCodeBench, HumanEval+, and MBPP+, to initiate the study. They began with simple prompts for four state-of-the-art LLMs: GPT-4o-mini, Llama 3.3 70B Instruct, Qwen2.5 72B Instruct, and DeepSeek Coder V2 Instruct, identifying instances where the models consistently failed benchmark test cases. Through automated iterative refinement, the team generated prompts capable of producing test-passing code, enabling a detailed analysis of the changes made. Measurements confirm that the automated process consistently identified and incorporated elements that improved code generation success rates.
By manually analysing initial and final prompts alongside test logs, scientists elicited a taxonomy of ten code generation prompt improvement dimensions. The study recorded varying levels of usage for each pattern, with participants indicating a strong preference for refining I/O formats and pre/post conditions. The breakthrough delivers a practical catalogue of guidelines for developers, offering actionable insights to enhance prompt engineering for code generation tasks. These guidelines can serve as valuable support for both software practitioners and educators, and potentially form the basis for automated recommender systems capable of identifying missing elements and suggesting prompt improvements.
Ten Guidelines for Effective LLM Code Prompts
Scientists have developed ten guidelines to help developers optimise prompts for large language models (LLMs) used in code generation. Through an iterative, test-driven approach, researchers automatically refined code generation and analysed the resulting improvements to identify key elements leading to successful test outcomes. These elements were then used to formulate guidelines focused on better specifying input/output, pre- and post-conditions, providing illustrative examples, adding detailed information, and resolving ambiguities. A subsequent assessment involving fifty practitioners revealed their usage of these improvement patterns, alongside their perceived usefulness, which interestingly didn’t always align with actual usage prior to learning the guidelines.
The findings suggest these guidelines could benefit not only software developers and educators, but also those designing LLM-aided software development tools. The authors acknowledge limitations related to participant subjectivity and potential self-selection bias within the study group. Furthermore, the generalizability of the findings may be limited by the choice of programming benchmarks and the focus on the Python language. Future research could explore the application of these guidelines to other programming languages and benchmarks, potentially leading to the discovery of additional improvement patterns. The team suggests creating a decision tree to guide developers on which prompt elements to improve and when, offering a practical application of the research. Ultimately, this work establishes a valuable reference point for developers interacting with LLMs, and provides a foundation for incorporating effective prompt engineering into software engineering curricula.
👉 More information
🗞 Guidelines to Prompt Large Language Models for Code Generation: An Empirical Characterization
🧠 ArXiv: https://arxiv.org/abs/2601.13118
