Johns Hopkins Researchers Tackle Errors in Robot Programming with Large Language Models

Johns Hopkins Researchers Tackle Errors In Robot Programming With Large Language Models

Researchers from Johns Hopkins University have been investigating the use of Large Language Models (LLMs) in robot programming. While LLMs have made robot programming more accessible, the code they generate can be error-prone due to their nondeterministic nature. The researchers aimed to identify common errors and propose strategies to reduce them. They found that LLMs often forget information provided in the system prompt, leading to execution errors. However, reinforcing task constraints and storing numerical task contexts in data structures can significantly reduce these errors. The team used three language models – ChatGPT, Bard, and LLaMA2 – in their study.

Introduction to Large Language Models in Robot Programming

Researchers from Johns Hopkins University, JuoTung Chen and ChienMing Huang, have been exploring the use of Large Language Models (LLMs) in robot programming. LLMs offer a new way to program robot applications through code generation via prompting. However, the code generated by LLMs is susceptible to errors. This research aims to empirically characterize common errors produced by LLMs in robot programming and propose strategies to reduce these errors.

The Problem with LLMs in Robot Programming

LLMs have been used to lower the barriers to robot programming, allowing end users to develop custom robot applications without substantial engineering training. However, the code generated by LLMs is not error-free due to its nondeterministic nature. These tools may produce inconsistent and occasionally incorrect code outputs. Existing research often focuses on general benchmarking errors, which may not fully capture the specific nuances and intricacies of code specific to a specialized domain such as robotics.

Research Questions and Methodology

The researchers sought to explore two research questions: What are the common errors produced by LLMs in end-user robot programming? And what practical strategies can be employed to mitigate and reduce these errors? To answer these questions, they designed a sequential manipulation task and tested three language models – ChatGPT, Bard, and LLaMA2 – to assess their capabilities in generating code to complete the task.

Key Findings

The key findings of the research are that LLMs are forgetful and do not consider information provided in the system prompt as hard fact. This forgetfulness leads to errors in code execution. In addition to execution errors, LLMs make various errors such as syntax errors and missing necessary libraries that cause failures in code interpretation. However, simple strategies such as reinforcing task constraints in the objective prompt and extracting numerical task contexts from the system prompt and storing them in data structures seem to notably reduce execution errors caused by LLM forgetfulness.

Experiment: Identifying Common Errors

To assess the code generation ability and performance of LLMs in robot programming, the researchers set up a sequential manipulation task. The task involves a robot picking up a graduated cylinder and pouring its contents into a beaker, a common step in biochemical lab tests. The researchers used a descriptive prompt to enhance the quality of LLM-generated responses.

Large Language Models Used

In the experiments, three language models were used: ChatGPT, Bard, and LLaMA2. Given the stochastic nature of these LLMs, each model was tested ten times while keeping the prompts and sequential manipulation task the same across trials.

Conclusion and Future Directions

The research provides valuable insights into the common errors produced by LLMs in robot programming and proposes practical strategies to mitigate these errors. The researchers call for further benchmarking of LLM-powered end-user development of robot applications.

“Forgetful Large Language Models: Lessons Learned from Using LLMs in Robot Programming” is an article authored by J. Chen and Chien-Ming Huang, published on January 22, 2024. The article, which appears in the Proceedings of the AAAI Symposium Series, explores the use of Large Language Models (LLMs) in the field of robot programming and the lessons learned from this application.