Predicting stock movements represents a significant challenge for even the most sophisticated analytical tools, and recent advances in large language models (LLMs) have yet to fully address this complex financial task. Xueyuan Lin, Cehao Yang, and colleagues from The Hong Kong University of Science and Technology (Guangzhou), alongside Ye Ma, Ming Li, Rongjunchen Zhang from Hithink RoyalFlush Information Network Co., Ltd, and Yang Ni, now demonstrate a method to substantially improve LLM performance in this area. Their work reveals that existing LLMs often mimic analyst opinions rather than applying independent logical reasoning, and struggle to weigh conflicting evidence effectively, hindering accurate prediction. To overcome these limitations, the team introduces Reflective Evidence Tuning (RETuning), a novel technique that encourages LLMs to construct a robust analytical framework, systematically evaluate evidence, and ultimately derive predictions based on logical reasoning, rather than contextual biases. This approach, validated on a newly created large-scale dataset encompassing all of 2024’s data for over five thousand stocks, unlocks the reasoning potential of LLMs in finance and ensures reliable performance even with evolving market conditions and unfamiliar stocks.
Financial Prediction with Reasoning LLMs
This study presents a compelling comparison of two large language models, DeepSeek-14B and DeepSeek-14B-SFT, as they tackle the complex task of financial prediction. The models were challenged to predict the opening price change for a specific stock, requiring them to understand market dynamics, technical analysis, and current events. Crucially, the models were also expected to clearly articulate their reasoning process, detailing the data analysis and evidence scoring that led to their predictions. The research reveals key differences in how these models approach the task, with the SFT version prioritizing clarity and conciseness.
Both models provide detailed analyses, covering macroeconomic factors, company fundamentals, technical indicators, news events, evidence scoring, and risk assessment. However, DeepSeek-14B-SFT delivers a more concise and focused prediction, streamlining information for easier understanding and actionability. The study addresses the tendency of these models to mimic analyst opinions rather than independently analyze information and critically evaluate conflicting evidence. RETuning actively encourages the dynamic construction of analytical frameworks, prompting the model to organize and score evidence for potential price increases or decreases, ultimately deriving predictions through reflective analysis. To facilitate this research, the team constructed a large-scale dataset encompassing all of 2024 for 5,123 stocks, totaling over 200,000 samples with long contexts of up to 32,000 tokens.
This comprehensive dataset integrates six key information sources, overcoming limitations of prior resources. Experiments demonstrate that RETuning unlocks the reasoning ability of the language model within the financial domain, serving as an effective cold-start method. Researchers developed Fin-2024, a large-scale dataset encompassing all of 2024 for 5,123 stocks, totaling over 209,063 samples with long context windows of up to 32,000 tokens. This dataset integrates six key information sources, overcoming limitations of prior datasets that lacked diversity. RETuning guides LLMs to construct an analytical framework, dynamically organizing and scoring evidence for potential price increases or decreases, rather than relying on contextual biases.
Experiments demonstrate that RETuning effectively unlocks prediction ability, improving performance over strong baseline models. Furthermore, the research shows that RETuning enables significant inference-time scalability, allowing LLMs to maintain performance even with limited computational resources. The method also generalizes beyond stock movement prediction, yielding improvements in other financial tasks and demonstrating robust performance on out-of-distribution stocks. These results lay the groundwork for deploying trustworthy, reasoning-driven LLMs in financial applications.
Reflective Tuning Improves Stock Prediction Accuracy
This research demonstrates a significant advancement in applying large language models to financial forecasting. Scientists discovered that existing models tend to mimic analysts’ opinions rather than independently analyzing information, and struggle to weigh conflicting evidence effectively. The researchers also constructed a comprehensive dataset encompassing all of 2024 for over five thousand stocks, incorporating diverse data sources including price history, news, analyst opinions, and macroeconomic indicators. While acknowledging that financial prediction remains challenging, the team highlights the model’s ability to assess sample difficulty, suggesting potential for more efficient training strategies.
👉 More information
🗞 RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models
🧠 ArXiv: https://arxiv.org/abs/2510.21604
