Researchers are tackling the persistent challenge of structural accuracy in Vision-Language Models (VLMs) when generating code from chart images. Minggui He and Mingchen Dai from the University of Science and Technology of China, working with Jian Zhang, Osamu Yoshie, and Yuya IEIRI from Waseda University, Japan, and Yilun Liu and Shimin Tao from NanKai University, China, present a novel approach called Chart Specification. This method introduces a structured intermediate representation designed to move training beyond simple text imitation towards a more semantically grounded supervision of chart structure. By filtering noise and employing a ‘Spec-Align Reward’ for verifiable feedback, the team demonstrates significantly improved performance on three public benchmarks, achieving up to 61.7% gains over existing methods with strong data efficiency and establishing new state-of-the-art results. This work highlights the potential of precise structural supervision as a key pathway to generating high-fidelity chart-to-code translations.

Scientists have developed a new method for converting chart images into executable plotting code with unprecedented fidelity. Achieving accurate reconstruction of visual data from charts has proven difficult for existing vision-language models, which often rely on imitating code tokens rather than understanding the underlying chart structure. This frequently results in outputs containing inconsistencies or entirely fabricated data. The research introduces Chart Specification, a structured intermediate representation that refocuses training on semantically grounded supervision rather than simple text imitation. Chart Specification filters out irrelevant syntactic details to create a more balanced training dataset and incorporates a ‘Spec-Align Reward’ system. This reward provides detailed, verifiable feedback on structural accuracy, allowing reinforcement learning to enforce consistent plotting logic. Experiments conducted on three established benchmarks demonstrate consistent outperformance compared to previous approaches. Notably, the method achieves strong data efficiency, surpassing leading baselines by up to 61.7% on complex charts using only 3,000 training samples. Scaling the training data to 4,000 samples establishes new state-of-the-art results across all evaluated metrics. Charts are a ubiquitous means of communicating quantitative information, appearing in scientific publications, business dashboards, and public reports. Automatically interpreting and repurposing these visualizations is therefore a critical capability for advanced document processing systems. The task of chart-to-code generation, creating executable plotting code from a static chart image, requires models to recover layout structure, data mappings, and numerical relationships. The core issue lies in the mismatch between how charts encode information and how plotting code is structured. Charts present continuous, spatially organised visual data, while code is discrete, symbolic, and procedural. Existing methods struggle to bridge this gap, often prioritising syntactic correctness over visual or structural accuracy, leading to structural hallucinations and a lack of global structural awareness. The new work addresses these shortcomings by introducing Chart Specification, a representation that explicitly encodes the structural logic of a chart, abstracting essential visual factors like layout, coordinate systems, and data bindings. This allows for more effective transmission of supervision signals across modalities and facilitates a more robust and accurate conversion process. A structured intermediate representation, Chart Specification, underpins the methodology for improving chart-to-code generation. This approach deliberately moves away from direct text imitation, instead focusing on semantically grounded supervision of chart structure. The core innovation lies in filtering syntactic noise from training data to create a structurally balanced dataset, thereby reducing ambiguity during model training. This curated dataset facilitates the implementation of a Spec-Align Reward, a mechanism designed to provide fine-grained and verifiable feedback on the structural correctness of generated code. Reinforcement learning is then employed, guided by the Spec-Align Reward, to enforce consistent plotting logic and enhance the fidelity of the generated code. This reward system assesses whether the generated code accurately reflects the underlying chart structure, contrasting with purely supervised methods which can be overly sensitive to training examples. To assess performance, experiments were conducted using three publicly available benchmarks, achieving competitive results with only 3,000 training samples. Scaling the training data to 4,000 samples further establishes state-of-the-art performance across all evaluated metrics. The resulting code and dataset have been made publicly available to encourage further research and facilitate reproducibility of the findings. Achieving a 61.7% performance gain on complex benchmarks demonstrates the effectiveness of the proposed method when trained with only 3,000 samples. This substantial improvement highlights the efficiency of the Chart Specification approach. Further scaling the training data to 4,000 samples resulted in new state-of-the-art results across all evaluated metrics, indicating a positive correlation between data volume and performance. The research establishes a pathway to high-fidelity chart-to-code generation through precise structural supervision. Chart Specification, a structured intermediate representation, abstracts essential visual factors from chart images, encompassing layout composition, coordinate systems, data bindings, and functional relationships, while remaining invariant to syntactic differences in plotting code. By filtering syntactic noise, the study constructed ChartStruct, a structurally balanced dataset designed to address the long-tail distribution of chart types and generative patterns. The Spec-Align Reward provides fine-grained, verifiable feedback on structural correctness. This reward mechanism enables reinforcement learning to enforce consistent plotting logic, addressing the limitations of binary execution feedback and unreliable pixel-level comparisons. The research demonstrates that focusing on structural fidelity, rather than token imitation, yields significant improvements in chart-to-code generation. Scientists are increasingly focused on enabling machines to ‘read’ and interpret visual information, and this work represents a step forward in that endeavour. Previous attempts to translate charts and graphs into accessible textual descriptions often prioritised mimicking the style of writing about charts, rather than understanding the underlying data and its relationships, leading to inaccurate outputs. By introducing a structured intermediate representation, researchers have shifted the focus from superficial imitation to genuine comprehension. This ‘Chart Specification’ acts as a filter, removing noise and ensuring the model concentrates on the core structural elements. The resulting improvements in accuracy, even with a relatively small training dataset, are substantial. However, the reliance on pre-defined chart types remains a limitation. Real-world data rarely conforms neatly to established templates, and handling novel or complex visualisations will require further refinement. Moreover, while the model excels at generating descriptions, it doesn’t yet offer any capacity for critical analysis or inference. Looking ahead, this technique could be integrated into accessibility tools, automatically generating descriptions for visually impaired users. Beyond that, the principles of ‘structural supervision’ could be applied to other areas of visual data analysis, from medical imaging to satellite imagery, paving the way for more reliable and trustworthy artificial intelligence.

👉 More information
🗞 Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation
🧠 ArXiv: https://arxiv.org/abs/2602.10880

Tags:

Reinforcement Learning Vision-Language Models