Guidelines Advance Large Language Models for Superior Code Generation

Researchers are increasingly exploring the potential of Large Language Models (LLMs) to automate and enhance software development, particularly in code generation. Alessandro Midolo from the University of Catania, Alessandro Giagnorio and Rosalia Tufano from the USI Università della Svizzera italiana, alongside Fiorella Zampetti, Gabriele Bavota and Massimiliano Di Penta from the University of Sannio, present a crucial empirical characterization of how to best prompt these models for effective code creation. Their work addresses a significant gap , the lack of clear, evidence-based guidelines for developers seeking to maximise LLM performance , and details ten actionable guidelines derived from an iterative, test-driven approach. This research not only informs best practice for practitioners and educators, but also offers valuable insights for building the next generation of LLM-assisted software development tools.

The team employed an innovative, test-driven approach, automatically refining prompts through iterative cycles and meticulously analysing the resulting improvements that led to successful test outcomes. This process allowed them to identify key elements crucial for effective prompt engineering, ultimately eliciting a set of ten actionable guidelines focused on areas such as precise I/O specification, clear pre- and post-conditions, illustrative examples, detailed information, and ambiguity resolution.

The study leveraged code generation tasks from established benchmarks , BigCodeBench, HumanEval+, and MBPP+ , and initially tested prompts with four state-of-the-art LLMs: GPT-4o-mini, Llama 3.3 70B Instruct, Qwen2.5 72B Instruct, and DeepSeek Coder V2 Instruct. The researchers focused on instances where the LLMs consistently failed to generate code passing the benchmark tests, then used automated refinement to create prompts that achieved success. Through careful analysis of the initial and refined prompts alongside test logs, they identified textual elements added during the refinement process, forming the basis of their ten-dimensional taxonomy of prompt improvement. This work establishes a clear connection between specific prompt characteristics and successful code generation, moving beyond general prompt engineering advice.
To validate these guidelines, the team conducted an assessment involving 50 practitioners from their professional network. Participants reported on their existing usage of the identified prompt improvement patterns, as well as their perceived usefulness, revealing a discrepancy between current practices and the potential benefits of the newly developed guidelines. Results showed that practitioners frequently refine I/O formats and pre/post-conditions, but less often employ techniques like providing examples or linguistic improvements. Interestingly, participants perceived even those patterns they currently use infrequently, such as adding I/O examples, as highly valuable, suggesting a significant opportunity for adoption.

This research has implications for software developers, educators, and those designing LLM-aided software development tools. The elicited guidelines can serve as a practical resource for improving prompt quality, while also laying the groundwork for automated recommendation systems capable of identifying missing elements in prompts and suggesting targeted improvements. Initially, the study employed code generation tasks sourced from three established Python benchmarks: BigCodeBench, HumanEval+ and MBPP+. Four state-of-the-art LLMs , GPT-4o-mini, Llama 3.3 70B Instruct, Qwen2.5 72B Instruct, and DeepSeek Coder V2 Instruct , were utilised, focusing on instances where initial prompts consistently failed benchmark test cases.

The core of the work involved an automated process where the LLMs iteratively refined prompts until they generated code passing the benchmark tests. Researchers then meticulously analysed both the initial and refined prompts, alongside detailed test logs, to pinpoint specific textual elements added during the refinement process. This detailed analysis enabled the creation of a taxonomy encompassing 10 distinct code generation prompt improvement dimensions, offering a structured understanding of effective prompting techniques. To validate these guidelines, a survey study was conducted with 50 practitioners from the researchers’ professional network, assessing both their current usage of prompt optimisation patterns and their perceived usefulness.

Participants reported varying levels of adoption for different patterns; for example, refining input/output formats and pre/post-conditions was common, while “by example” approaches and linguistic improvements were less frequently employed. Interestingly, practitioners rated patterns they currently used as highly useful, but also indicated high potential in patterns they used less often, such as adding I/O examples. This discrepancy highlights a gap between perceived value and actual implementation, suggesting the guidelines could drive wider adoption of beneficial techniques. The resulting guidelines provide valuable support for software developers and educators, and also lay the groundwork for future automated recommender systems capable of identifying missing elements within prompts and suggesting targeted improvements.

Ten Guidelines for Effective LLM Code Prompts

Scientists have developed ten guidelines to help developers optimise prompts for large language models (LLMs) used in code generation. Through an iterative, test-driven approach, researchers automatically refined code generation and analysed the resulting improvements to identify key elements leading to successful test outcomes. These elements were then used to formulate the guidelines, which focus on aspects such as clearly specifying input/output, pre- and post-conditions, providing illustrative examples, adding detailed information, and resolving ambiguities. An assessment involving fifty practitioners demonstrated that while developers reported intending to use these guidelines, actual usage didn’t always align with perceived usefulness prior to learning them, suggesting a gap between awareness and implementation.

The findings have implications for software developers, educators, and those creating LLM-aided software development tools, offering a reference point for prompt engineering and potential inclusion in curricula. The authors acknowledge limitations related to participant subjectivity, potential self-selection bias in the study group, and the generalizability of findings based on the specific benchmarks and programming language (Python) used. Future work could explore the application of these guidelines to other programming languages and benchmarks, as well as the development of a decision tree to guide developers in selecting appropriate optimisation techniques for different scenarios.

👉 More information
🗞 Guidelines to Prompt Large Language Models for Code Generation: An Empirical Characterization
🧠 ArXiv: https://arxiv.org/abs/2601.13118

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Sod2d Achieves 0.69 Performance Portability across NVIDIA GPUs for CFD Simulations

Sod2d Achieves 0.69 Performance Portability across NVIDIA GPUs for CFD Simulations

January 22, 2026
Hoverai Achieves 0.90 Command Recognition Accuracy with Aerial Conversational Agents

Hoverai Achieves 0.90 Command Recognition Accuracy with Aerial Conversational Agents

January 22, 2026
Glioma MRI Segmentation Achieves 0.929 Dice Score with Reduced Resources

Glioma MRI Segmentation Achieves 0.929 Dice Score with Reduced Resources

January 22, 2026