Researchers are tackling the complex problem of enabling robots to navigate human spaces not just safely, but also politely. Zilin Fang, Anxing Xiao, and David Hsu, from the School of Computing at the National University of Singapore, alongside Gim Hee Lee, present a novel framework that combines traditional path planning with contextual social reasoning. Their work represents a significant advance because it moves beyond simple obstacle avoidance, allowing robots to anticipate and adhere to unwritten social rules during navigation. By integrating a vision-language model (VLM) to evaluate potential paths based on social expectations, the team has developed a system capable of real-time adaptation and demonstrably improved performance in diverse human-robot interaction scenarios, achieving reduced personal space violations and more natural interactions.
This work addresses the limitations of traditional collision-avoidance systems by incorporating an understanding of social norms and ongoing human activities.
The system generates geometrically feasible paths and then employs a fine-tuned vision-language model to evaluate these options, selecting a route optimised for social acceptability. A key innovation lies in distilling social reasoning from large foundation models into a smaller, more efficient model, facilitating real-time adaptation during human-robot interactions.
Experiments conducted in four distinct social navigation contexts demonstrate superior performance with this new method. Specifically, the research achieved the lowest personal space violation duration, minimising intrusions into pedestrian comfort zones. Furthermore, the system exhibited minimal pedestrian-facing time, indicating a heightened awareness of direct engagement, and crucially, recorded no social zone intrusions.
This framework formulates social robot navigation as a multi-objective optimisation problem, balancing geometric feasibility with in-context semantic understanding. The architecture adopts a hierarchical approach, featuring asynchronous modules for path planning, socially compliant selection, and safe reactive control.
Candidate paths are initially sampled based on geometric constraints before the vision-language model assesses their social appropriateness in a receding-horizon fashion. This pipeline distills complex social reasoning into a compact model, allowing for rapid inference and real-time responsiveness. Real-world testing on a Boston Dynamics Spot legged robot confirms that this approach surpasses representative baseline methods, including group-based, reinforcement learning-based, and other vision-language model-based techniques, as well as a foundation model for visual navigation.
Ablation studies validate the system’s ability to generate both collision-free and socially compliant paths, a capability not consistently guaranteed by direct vision-language model path prediction. The research quantifies social performance using interruption-related metrics, reducing reliance on subjective assessments and enabling a more objective evaluation of social compliance in dynamic environments.
Social path selection via geometrically-informed vision-language modelling
A hierarchical architecture underpins the social robot navigation framework, integrating geometric planning with contextual social reasoning. Candidate paths are initially sampled subject to geometric constraints imposed by obstacles and detected humans. Subsequently, a fine-tuned vision-language model (VLM) evaluates these geometrically feasible paths, leveraging contextually grounded social expectations to select a socially optimised path for the controller.
This task-specific VLM distills social reasoning from large foundation models into a smaller, more efficient model, facilitating real-time adaptation within diverse human-robot interaction contexts. The system employs a receding-horizon approach, where the selected path is fed back to the path planning module as a reference for generating new trajectories.
A modified version of the ORCA algorithm is then utilised to ensure robust pedestrian avoidance during execution. This pipeline prioritises a decomposable objective function, assuming the optimal semantic solution resides within a subset of the geometric optimum, enabling asynchronous operation of the geometry-aware path planning and socially compliant path selection modules.
Experiments were conducted using a Boston Dynamics Spot legged robot in four distinct social navigation contexts, designed to reflect common patterns of social behaviour and minimise subjective evaluation. Performance was quantified using interruption-related metrics, specifically measuring personal space violation duration, pedestrian-facing time, and instances of social zone intrusions.
Results demonstrate that the proposed method achieves the best overall performance, exhibiting the lowest personal space violation duration, minimal pedestrian-facing time, and zero social zone intrusions. The work acknowledges that while social behaviour is multimodal and shaped by individual preferences, many situations follow common patterns that minimise disruption.
Collision avoidance and social compliance in robot navigation using vision-language modelling
Researchers developed a social robot navigation framework achieving collision-free movement without social zone intrusions. The system consistently demonstrated strong performance across all tested scenarios, maintaining low durations of personal space violation and time facing pedestrians. Quantitative comparisons reveal the method achieves the best overall performance, with no observed social zone intrusions during trials.
The work utilizes a two-stage strategy involving sampling paths respecting human motion and obstacle constraints, followed by social context reasoning through a fine-tuned vision-language model. This design enables dynamic adaptation to environmental changes and ensures compliance with social norms, generalizing effectively across diverse contexts.
Predictive future path referencing mitigates any latency induced by the vision-language model, further enhancing performance. In the Walking-talk scenario, the VLM-Social-Nav baseline blocked a pedestrian by stopping and rotating directly in front of them, while the proposed method navigated successfully.
The Queuing scenario saw VLM-Social-Nav repeatedly turning right before becoming stuck, whereas the presented system maintained a clear path. GSON performed well in static scenarios but struggled with dynamic interactions, reacting too late to moving agents. Ablation studies examined the impact of encoding temporal information and prediction horizon.
Using only the shortest visible path segment yielded selections closely matching the full system, validating the planning approach. Analysis of prediction horizon across ten scenes showed that longer horizons reduced the generation of socially inappropriate paths, decreasing interruptions by 39.17% compared to shorter horizons. These results demonstrate the effectiveness of incorporating predictive motion data into the navigation process.
Socially Compliant Navigation via Geometric Planning and Contextual Understanding
A novel social robot navigation framework integrates geometric planning with contextual social reasoning to enable efficient and compliant movement in human environments. The system extracts both static obstacles and dynamic human behaviour, generating geometrically feasible paths before employing a vision-language model to evaluate these options.
This model, refined for task-specific social understanding, assesses paths based on contextual social expectations, ultimately selecting the most appropriate trajectory for the robot. Experiments conducted across four distinct social navigation scenarios demonstrate superior performance, characterised by reduced personal space violations, minimal time spent facing pedestrians, and the complete avoidance of social zone intrusions.
This approach facilitates collision-free, socially aware, and efficient robot navigation in varied human-centred settings by fusing motion prediction with geometric constraints and interpreting social norms via a specialised language model. Validation through ablation studies confirms the effectiveness of the framework’s design choices.
The authors acknowledge limitations including the model’s reliance on single-frame analysis, potentially missing extended social context, difficulties with ambiguous or simultaneous human actions, and the possibility of excluding optimal paths during post-processing. Future work could address these issues by incorporating memory mechanisms to track longer-term social interactions, improving the model’s ability to disambiguate complex scenarios, and refining the path selection process to balance safety with optimality.
👉 More information
🗞 From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection
🧠 ArXiv: https://arxiv.org/abs/2602.09002
