Researchers are tackling the challenge of enabling robots to reliably interpret and execute free-form human instructions for manipulation tasks. Archit Sharma, Dharmendra Sharma, and John Rebeiro, from the Indian Institute of Technology Mandi, alongside Peeyush Thakur et al., present Instruct2Act, a novel pipeline that translates natural language into sequenced robot actions. This work is significant because it achieves high accuracy, 91.5% on a proprietary dataset, and deterministic, real-time manipulation using a lightweight, fully on-device system that requires no cloud connectivity. Evaluations on a robotic platform across four common tasks demonstrate a 90% success rate, with sub-action inference completed in under 3.8 seconds, paving the way for practical robotic applications in resource-constrained environments.
Scientists have developed a system to translate high-level commands into reliable manipulation. Their approach comprises two stages: the instruction to actions module (Instruct2Act), and the robot action network (RAN). Instruct2Act is a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions, such as reach, grasp, move, and place.
The robot action network (RAN) utilises the dynamic adaptive trajectory radial network (DATRN) alongside a vision-based environment analyser (YOLOv8) to generate precise control trajectories for each sub-action. The entire system operates on a modest system without requiring cloud services. On a custom proprietary dataset, Instruct2Act achieves 91.5% sub-actions prediction accuracy while retaining a small footprint.
Mapping Natural Language Instructions to Robotic Actions via Multimodal Perception
Scientists are increasingly deploying robots in real-world environments to assist humans with complex tasks, particularly in industrial and healthcare settings where precise object manipulation is essential. Human instructions can be conveyed to robots via different modalities, including speech, text, gestures, and demonstrations, with text-based instructions representing one of the simplest and most effective approaches.
Natural language processing enables robots to understand and respond to human language, offering flexibility and user-friendliness compared to fixed menus or rigid programming. Recent advances, such as large language models and vision language models, have significantly advanced human-robot interaction by enabling robots to grasp and react to free-form natural language commands.
Vision-language models integrate text and visual inputs to learn rich, multimodal representations, and recent vision-language action pipelines map free-form instructions directly to object-aware perception and trajectory generation, allowing robots to follow commands in real time. Implementing end-to-end vision-language action stacks in a camera setup that only uses eye-in-hand configurations remains challenging.
Relying on only an eye-in-hand camera in changing environments is brittle, as the arm and gripper frequently occlude key objects, viewpoints shift rapidly, and limited global context induces pose and frame ambiguity, degrading manipulation reliability. Lightweight vision-language action variants often rely on wide-angle multi-camera setups or large multimodal encoders with long context windows, introducing latency and increasing maintenance costs in real-world environments.
These issues indicate practical deployment challenges rather than flaws in the vision-language action methods, motivating the approach taken in this work. The primary objective is to develop a framework capable of executing sophisticated tasks without increasing system complexity. Researchers developed a methodology that converts free-form natural language commands into robot actions and executes them entirely on-device.
It is a two-stage pipeline that integrates instruction-to-action prediction and robot action execution. They introduced Instruct2Act, a compact BiLSTM with a multi-head attention autoencoder, to decompose language commands into atomic sub-actions. This lightweight design aims to produce fine-grained action sequences with low computational overhead.
The prediction module is linked to a robot action network that creates precise, adaptive motion trajectories for each sub-action. The robot action network consists of a dynamic adaptive trajectory radial network, which generates object-aware motion paths and adjusts them in real time using the integrated depth camera and proprioception of the manipulator.
Researchers compared their dynamic adaptive trajectory radial network-based trajectory generation approach with the dynamic movement primitive to overcome its limitations, such as iterative hyperparameter searches, phase/gain scheduling, and long fitting times. The entire pipeline operates with low computational overhead, demonstrating its suitability for resource-constrained environments.
The reliability of this methodology was validated in controlled laboratory experiments and preliminary trials in real-world healthcare environments. This work demonstrates a practical pathway toward autonomous robots that can operate effectively in real-world settings without reliance on massive, off-board computational resources.
The practical motivation behind this work is to address the following challenges of robot manipulator execution: accuracy and precision, robustness with eye-in-hand sensing, and real-time processing. The primary contributions of this paper are as follows: a methodology that integrates a learning module and a robot action execution module, bridging the gap between natural-language parsing and precise robot control; a custom instruction-to-action dataset and a lightweight Instruct2Act framework that lowers computational overhead while accurately extracting fine-grained sub-actions from human instructions; and a robot action network that couples a dynamic adaptive trajectory radial network with a vision-based environment analyzer to generate smooth, precise manipulation, placement, and interaction within the robot’s workspace.
The proposed methodology consists of two modules: Instruct2Act, which translates spoken or written instructions into a sequence of sub-actions that form a task plan for the robot, and the robot action network, which converts these sub-actions into precise control commands. For spoken input, an offline speech-to-text model first transcribes the instruction into text, ensuring reliable operation without internet connectivity.
Table I details the sub-actions vocabulary for robot task execution, including actions such as reach, grasp, lift, move, tilt, give, release, place, wipe, stir, and retract, along with their descriptions. The Instruct2Act translates natural-language task descriptions into ordered sequences of sub-actions.
To capture richer contextual semantics, each instruction is first embedded using the BERT-large-uncased model, which is employed solely for task embedding extraction. The Instruct2Act is based on a bidirectional LSTM with a multi-head attention autoencoder, aiming to translate natural language task descriptions into sub-action sequences, improving the robot’s ability to understand and perform complex tasks.
The data preparation involved creating a proprietary dataset in English for fine-grained robotic manipulation comprising 2,850 natural-language instructions. Of these, 2,280 are used for model development with an 80/20 split (1,792 train, 448 validation), and a held-out test set of 570 unseen instructions is reserved to assess generalisation to novel phrasings and task compositions.
Each instruction is paired with a structured sequence of sub-actions and associated objects. The corpus spans multiple task types (pick and place, pick and pour, table cleaning, give, and compositional variants), at least ten distinct workspace objects, and diverse linguistic styles (synonyms, paraphrases, and multi-clause commands).
Task embedding extraction uses BERT to convert task descriptions into numerical representations that capture their semantic meaning. Specifically, the bert-large-uncased model is used to generate embeddings, tokenizing each task description into word pieces and processing them through its transformer layers to capture contextual significance.
To represent sequences of sub-actions as numerical data, each sequence is padded to a uniform length L, and each sub-action is converted into a one-hot encoded vector, transforming symbolic action sequences into fixed-size numerical tensors for neural network training. The learning architecture begins with BERT embeddings derived from task descriptions, where each embedding has a dimension of d.
These embeddings capture semantic relationships and contextual information from the input text. The repeated instruction embeddings are processed by a bidirectional LSTM that reads the sequence in both forward and backward directions.
Real-time robotic manipulation from natural language via on-device sub-action prediction
Instruct2Act attains 91.5% sub-actions prediction accuracy on a custom dataset while maintaining a small computational footprint. Real-robot evaluations across pick-place, pick-pour, wipe, and pick-give tasks achieved 90% overall success. Sub-action inference completes in under 3.8 seconds, demonstrating rapid processing capabilities.
End-to-end task executions take between 30 and 60 seconds, varying with the complexity of the assigned task. This work presents a two-stage pipeline converting natural-language commands into reliable robot manipulation, operating entirely on-device. The instruction-to-action module, Instruct2Act, utilises a BiLSTM with a multi-head-attention autoencoder to parse instructions into ordered action sequences.
This lightweight design facilitates fine-grained action sequence production with minimal computational demand. The robot action network (RAN) then generates precise control trajectories for each sub-action, leveraging the dynamic adaptive trajectory radial network (DATRN) and vision-based environment analysis with YOLOv8.
The DATRN within the RAN generates object-aware motion paths and adjusts them in real time, integrating depth camera data and manipulator proprioception. This approach overcomes limitations found in dynamic movement primitives, such as iterative hyperparameter searches and lengthy fitting times. The entire system is designed for resource-constrained environments, enabling practical deployment in settings like healthcare facilities. This methodology provides a pathway towards deterministic, real-time manipulation using a single camera setup.
Deterministic robotic manipulation from natural language instructions
Researchers have developed a fully on-device robotic system capable of interpreting free-form human instructions and executing corresponding manipulation tasks. This pipeline comprises two key modules: Instruct2Act, which parses instructions into a sequence of actions, and the Robot Action Network (RAN), which generates precise control trajectories using dynamic adaptive trajectory radial networks and vision-based environmental analysis.
The entire system operates without reliance on cloud services, enabling real-time performance on resource-constrained hardware. Evaluations across pick-place, pick-pour, wipe, and pick-give tasks demonstrate 90% overall success, with sub-action inference completed in under 3.8 seconds and complete task execution within 30-60 seconds depending on complexity.
The Instruct2Act model achieves 91.5% accuracy in predicting sub-actions, highlighting the effectiveness of the combined approach in achieving deterministic manipulation with a single camera. Analysis of failures revealed that issues primarily arise from perception inaccuracies, system lags, and, to a lesser extent, incorrect sub-action sequence prediction.
Although the system currently exhibits limitations with very long or highly complex instructions, struggling to maintain accuracy across extended sequences, future work will focus on expanding the training dataset to include more diverse and lengthy commands. Exploration of lightweight transformer-based architectures is also planned to improve sub-action prediction for multi-step activities while preserving computational efficiency. These developments aim to enhance the system’s ability to handle increasingly sophisticated instructions and broaden its applicability in real-world scenarios, including healthcare environments.
👉 More information
🗞 Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation
🧠 ArXiv: https://arxiv.org/abs/2602.09940
