Researchers are tackling the notoriously difficult electric vehicle routing problem with time windows , a key challenge in sustainable logistics. Mertcan Daysalilar (University of Miami), Fuat Uyguroglu (Cyprus International University), and Gabriel Nicolosi (Missouri University of Science and Technology) et al. present a novel curriculum-based deep reinforcement learning framework to improve both the speed and reliability of solutions. Existing deep reinforcement learning models often falter when faced with the complex constraints of this problem, but this new approach uses a phased learning system, gradually increasing difficulty to ensure stable training and impressive generalisation , even on problems with up to 100 customers, despite being trained on much smaller instances. This breakthrough offers a significant step towards practical, efficient, and dependable electric vehicle routing in real-world applications.
The research team designed a structured three-phase curriculum that progressively increases problem complexity, allowing the agent to first master distance and fleet optimization, then battery management, and finally the complete EVRPTW scenario. This staged approach circumvents the sparse reward signals that typically plague end-to-end DRL models, fostering stable learning and preventing policy collapse.
To ensure consistent learning across each phase, the team implemented a modified proximal policy optimization algorithm, carefully tuning hyperparameters, employing value and advantage clipping, and utilizing adaptive learning-rate scheduling. The core of the model lies in a heterogeneous attention encoder, enhanced with both global-local attention mechanisms and feature-wise linear modulation. This specialized architecture is designed to explicitly capture the unique characteristics of depots, customers, and crucially, charging stations, allowing the agent to make informed routing decisions considering energy constraints. Trained initially on small instances with only N=10 customers, the model demonstrated remarkable generalization, successfully handling unseen instances ranging from N=5 to N=100.
Experiments reveal that this curriculum-guided approach achieves high feasibility rates and competitive solution quality on out-of-distribution instances, significantly surpassing standard DRL baselines that often fail under dense constraints. The team’s work effectively bridges the gap between the speed of neural networks and the operational reliability demanded in real-world logistics. By decomposing the problem into manageable phases, the CB-DRL framework enables the agent to learn a robust policy capable of navigating the complexities of the EVRPTW, offering a promising solution for sustainable and efficient delivery operations. This innovation has the potential to significantly improve the planning and execution of electric vehicle fleets in dynamic, real-time environments.
The study establishes a clear pathway for applying deep reinforcement learning to complex combinatorial optimization problems, particularly those with stringent constraints. Researchers trained the model exclusively on small instances with N=10 customers, yet it exhibited robust generalization to unseen instances ranging from N=5 to N=100, demonstrating a significant improvement over standard baseline methods on medium-scale problems. Experimental results confirm that the curriculum-guided approach achieves high feasibility rates and competitive solution quality on out-of-distribution instances, effectively addressing the challenges of sparse rewards and unstable training that often hinder standard DRL applications. This framework offers a viable path toward achieving both speed and operational reliability in electric vehicle routing, paving the way for more sustainable and efficient logistics solutions.
Three-Phase Curriculum for EVRPTW Learning delivers progressive skill
Scientists developed a curriculum-based deep reinforcement learning (CB-DRL) framework to address instability in solving the electric vehicle routing problem with time windows (EVRPTW). The study pioneered a structured three-phase curriculum, progressively increasing problem complexity to enhance training stability and generalisation capabilities. Initially, the agent learns distance and fleet optimisation in Phase A, followed by battery management in Phase B, and culminating in the full EVRPTW in Phase C. This phased approach tackles the challenge of dense constraints often encountered in complex routing problems.
To ensure stable learning across each phase, researchers employed a modified proximal policy optimization algorithm, carefully tuning hyperparameters for each stage. Value and advantage clipping were implemented alongside adaptive learning-rate scheduling, further refining the learning process. The policy network itself is built upon a heterogeneous attention encoder, enhanced by both global-local attention and feature-wise linear modulation, a crucial architectural innovation. This specialised design explicitly captures the distinct properties of depots, customers, and charging stations, allowing the model to differentiate their roles within the routing problem.
The team engineered a heterogeneous graph attention encoder to effectively represent the EVRPTW as a graph, acknowledging the differing functions of each node type. Unlike standard attention models, this encoder utilises separate projection parameters, WQcust, WQstation, and WQdepot, enabling the model to learn distinct relational dynamics between nodes. For example, the distance between a customer and a charging station is weighted differently than the distance between two stations, reflecting its importance for feasibility. The resulting embeddings are then processed by a global-local attention edge encoder, fusing local neighbourhood information with global routing context to aggregate features across various spatial scales.
Experiments employed instances with N=10 customers for training, demonstrating robust generalisation to unseen instances ranging from N=5 to N=100. The model significantly outperformed standard baseline methods on medium-scale problems, achieving high feasibility rates and competitive solution quality on out-of-distribution instances where conventional DRL approaches failed. This curriculum-guided approach effectively bridges the gap between computational speed and operational reliability, showcasing the power of structured learning in complex optimisation tasks.
Curriculum learning stabilises electric vehicle route planning
Scientists have developed a curriculum-based deep reinforcement learning (CB-DRL) framework to address instability issues in solving the electric vehicle routing problem with time windows (EVRPTW) The research team tackled the challenge of optimizing routes for electric vehicles while considering customer time constraints, battery usage, and fleet size, a notoriously complex problem in sustainable logistics. Experiments revealed a structured three-phase curriculum that progressively increases problem complexity, beginning with distance and fleet optimization (Phase A), followed by battery management (Phase B), and culminating in the full EVRPTW (Phase C). The team measured significant improvements in training stability by employing a modified proximal policy optimization algorithm with phase-specific hyperparameters, value and advantage clipping, and adaptive learning-rate scheduling.
Results demonstrate that this approach effectively bridges the gap between the speed of neural networks and the operational reliability required for real-world applications. The policy network is built upon a heterogeneous attention encoder enhanced by global-local attention and feature-wise linear modulation, explicitly capturing the unique characteristics of depots, customers, and charging stations. Trained exclusively on small instances with N=10 customers, the model exhibited robust generalization to unseen instances ranging from N=5 to N=100. Data shows a substantial performance increase on medium-scale problems compared to standard baseline models.
Specifically, the curriculum-guided approach achieves high feasibility rates and competitive solution quality on out-of-distribution instances where conventional DRL baselines consistently fail. Measurements confirm that the CB-DRL framework successfully navigates the sparse reward signal inherent in the EVRPTW, avoiding the instability caused by frequent constraint violations, such as depleted batteries or missed deadlines, that plague standard end-to-end reinforcement learning models. The breakthrough delivers a method for disentangling the learning of routing topology from ensuring feasibility under complex constraints, allowing the agent to first learn feasible routes before optimizing delivery timing. Tests prove that the three-phase curriculum enables the neural policy to achieve near-optimal performance and zero-shot generalization on benchmark instances. The objective function, defined as minimizing total travel distance and fleet size with a weighting factor λ, was successfully optimized, demonstrating the framework’s ability to balance cost and efficiency. This work establishes a foundation for more robust and scalable solutions to the EVRPTW, paving the way for more sustainable and efficient logistics operations.
👉 More information
🗞 A Curriculum-Based Deep Reinforcement Learning Framework for the Electric Vehicle Routing Problem
🧠 ArXiv: https://arxiv.org/abs/2601.15038
