The efficacy of automatically generated text in practical applications remains a significant challenge, particularly when deployed in contexts demanding informed decision-making. Current evaluation methods, focused on linguistic similarity or perceived fluency, often fail to correlate with actual performance. Yu-Shiang Huang, Chuan-Ju Wang, and colleagues address this issue in their research, detailed in the article ‘Decision-oriented Text Evaluation’. They propose a novel framework that directly assesses the impact of generated text on decision outcomes, utilising both human investors and large language model agents operating within a simulated market environment. Their investigation, employing financial market digests, reveals limitations in relying solely on summary information, but demonstrates the potential for improved performance when combining human expertise with the analytical capabilities of artificial intelligence.
Evaluating natural language generation (NLG) presents a considerable challenge, particularly when assessing its impact on complex decision-making, and current research actively investigates methods that move beyond traditional metrics to focus on real-world outcomes, specifically within the financial domain. This study rigorously examines the efficacy of NLG in investment scenarios, measuring its influence on both human investors and large language models (LLMs), and revealing critical insights into effective evaluation frameworks.
The research establishes a decision-oriented approach, directly linking generated text to financial performance and challenging conventional NLG evaluation methods. Participants, including professional investors, make investment decisions based exclusively on provided summaries, allowing researchers to quantify the impact of the generated text on trade performance. This contrasts sharply with traditional methods that often rely on indirect measures of text quality and fail to correlate strongly with real-world efficacy. The core principle is to move beyond assessing linguistic features like fluency or coherence, and instead measure the downstream effect on financial outcomes.
Results demonstrate that neither human investors nor LLM agents consistently achieve superior performance when relying solely on concise summaries, suggesting that simple distillation of market information does not guarantee improved outcomes. The study highlights the potential for NLG to augment human capabilities and enhance decision-making processes, rather than replacing human expertise. This suggests that the value of NLG lies not in automating investment decisions, but in providing investors with better information to inform their own judgements.
The findings underscore the importance of evaluating NLG not by its inherent linguistic qualities, but by its ability to facilitate synergistic decision-making between humans and LLMs. This advocates for a shift towards outcome-based evaluation frameworks that directly measure the impact of generated text on complex tasks, such as financial investment. Such frameworks require defining clear metrics for success, such as portfolio returns or risk-adjusted performance, and assessing how NLG influences these metrics.
The study actively compares GPT-4o and Gemini-2.0-Flash, demonstrating that the quality of generated text directly influences decision-making and justifying the focus on GPT-4o for detailed human evaluation. GPT-4o consistently produces summaries that lead to better investment choices than those generated by Gemini-2.0-Flash, supporting the notion that advancements in LLM capabilities translate to tangible improvements in the quality of information provided to investors. This suggests that investing in more sophisticated LLMs can yield measurable benefits in financial applications.
Future work should investigate the specific characteristics of analytical commentaries that drive improved collaborative performance and identifying the key elements of effective financial narratives, such as nuanced risk assessments or insightful market interpretations. Optimising NLG systems requires a deeper understanding of what constitutes high-quality analytical content and how it can best be presented to human investors. Further research could also explore the impact of different user interfaces and interaction paradigms on the effectiveness of human-LLM collaboration.
Expanding the scope of evaluation to encompass a wider range of financial instruments and market conditions will strengthen the generalizability of the findings. Assessing NLG performance across different asset classes and economic cycles will provide a more comprehensive understanding of its capabilities and limitations. Additionally, investigating the potential for NLG to personalise investment recommendations based on individual investor profiles could unlock new opportunities for enhancing financial outcomes.
The study establishes a clear path forward for evaluating NLG in complex decision-making contexts and advocating for a shift towards outcome-based metrics that directly measure real-world impact. By focusing on how NLG can augment human capabilities and improve financial outcomes, researchers can unlock the full potential of this technology and create more effective and intelligent financial systems. This research provides valuable insights for developers, investors, and policymakers alike, paving the way for a future where NLG plays a central role in shaping the financial landscape.
👉 More information
🗞 Decision-oriented Text Evaluation
🧠 DOI: https://doi.org/10.48550/arXiv.2507.01923
