Generating descriptive language from the sense of touch remains a significant challenge, yet holds immense potential for applications ranging from virtual reality to assistive technologies for the visually impaired. Guimin Hu from University of Copenhagen, Daniel Hershcovich from the same institution, and Hasti Seifi from Arizona State University, now present HapticLLaMA, a novel model that translates vibrations into natural language descriptions. This research addresses a gap in multimodal artificial intelligence, which has largely prioritised vision and audio, by developing a system capable of interpreting haptic signals and expressing them in a meaningful way. The team demonstrates that HapticLLaMA not only achieves high scores on automated evaluation metrics, but also generates captions that humans find intuitive and accurately reflect the perceived sensation, paving the way for more immersive and accessible sensory experiences.
Haptic signals, representing the sense of touch, have remained comparatively under-explored within the field of multimodal artificial intelligence. To address this gap, researchers formalise the task of haptic captioning and propose HapticLLaMA, a novel multimodal sensory language model. This model interprets vibration signals and translates them into descriptive text within a specified sensory, emotional, or associative category. The research investigates two distinct types of haptic tokenisers, a frequency-based approach and a method utilising EnCodec, to convert haptic signals into sequences of discrete units, thereby enabling their integration with the language model.
Generating Haptic Descriptions with Large Language Models
This document details research exploring the use of Large Language Models (LLMs), specifically LLaMA, to generate descriptive captions for haptic (touch) sensations. The goal is to bridge the gap between haptic signals and human-understandable descriptions, enabling applications like haptic feedback design and accessibility. Researchers tackled the challenge of describing haptic sensations in a rich, nuanced way, noting that current methods often lack detail or fail to capture the emotional and associative aspects of touch. Their solution leverages the power of LLMs to automatically generate descriptive captions, extending beyond simple sensory descriptions to include emotional and associative qualities.
The researchers used the HapticCap dataset, which contains haptic signals paired with captions covering sensory, emotional, and associative aspects. They fine-tuned LLaMA (specifically LLaMA3. 2-3B) using a two-stage process: first, generative training to learn the relationship between haptic signals and captions, and second, Direct Preference Optimization (DPO) to align the model’s output with human preferences for caption quality. They explored different methods for encoding haptic signals, including raw signal data and EnCodec (an audio compression technique) to represent the signal as a more compact and potentially more informative feature vector.
Using EnCodec led to better performance compared to raw signal data, and the DPO training stage significantly improved caption quality and human-alignment. They designed specific prompts for LLMs (including GPT-4. 5) to guide the generation of relevant haptic descriptions, focusing on sensory, emotional, and associative aspects. The results demonstrate that LLMs can generate rich haptic descriptions, with the fine-tuned LLaMA models (HapticLLaMA) demonstrating the ability to capture these qualities. GPT-4. 5, when prompted effectively, served as a strong baseline for comparison, demonstrating the potential of larger, pre-trained LLMs for this task. The models were evaluated using metrics that measure the similarity between generated captions and human-written reference captions, opening possibilities for more intuitive and expressive haptic feedback systems.
Vibrations Translated into Descriptive Language by HapticLLaMA
Researchers have developed HapticLLaMA, capable of interpreting vibrations and translating them into descriptive captions, opening possibilities for applications in virtual reality, accessibility tools, and rehabilitation therapies. This work addresses a gap in multimodal research, which has historically focused on vision and audio while largely overlooking the sense of touch. The system functions by converting vibration signals into a sequence of discrete units, allowing a language model to understand and describe the tactile experience. The team explored two methods for converting these vibrations into usable data, one based on frequency analysis and another utilising a more advanced encoding technique.
Both approaches significantly improved the system’s ability to generate accurate captions, with the advanced encoding method consistently achieving slightly better results. Further refinement was achieved through a process of reinforcement learning from human feedback, which aligned the system’s output more closely with human perception and boosted overall performance. Quantitative evaluation demonstrates substantial gains over existing language models, and human evaluation confirms these improvements, with over 61% of generated captions receiving ratings above 3. 5 on a 7-point scale, a 10% improvement over initial results.
This indicates the system not only generates technically accurate descriptions but also aligns with how humans perceive and understand tactile sensations. Further analysis reveals that the system excels at describing emotional qualities conveyed through vibrations, potentially due to greater consistency in human annotation of these sensations. This work represents a significant step towards creating more immersive and accessible experiences through the integration of haptic technology and artificial intelligence.
Haptic Vibrations Interpreted and Described by AI
HapticLLaMA represents a significant step forward in multimodal artificial intelligence, successfully interpreting vibration signals and translating them into descriptive captions. Researchers developed this system by combining large language models with novel methods for processing haptic data, effectively bridging the gap between the sense of touch and language. The model utilizes two distinct approaches to convert vibration signals into a format understandable by the language model, and then employs supervised learning and reinforcement learning from human feedback to refine its captioning abilities. Evaluation demonstrates strong performance, with automated metrics and human assessments confirming the model’s capacity to accurately perceive and describe haptic vibrations.
This work establishes a foundation for understanding and integrating sensory data into large language models, opening possibilities for applications in areas like virtual reality, accessibility tools, and rehabilitation technologies. While the current system focuses on vibration signals, the researchers acknowledge limitations in evaluating caption quality using standard metrics, as these do not fully capture the semantic alignment between the haptic signal and the generated text. Future work could address these limitations and expand the model’s capabilities to encompass a wider range of sensory inputs, ultimately enhancing its potential for real-world deployment.
👉 More information
🗞 HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
🧠 ArXiv: https://arxiv.org/abs/2508.06475
