Hoverai Achieves 0.90 Command Recognition Accuracy with Aerial Conversational Agents

Scientists are tackling the challenge of seamless communication between drones and people in shared spaces. Yuhua Jin, Nikita Kuzmin, and Georgii Demianchuk, from their respective institutions, alongside Lezina, Mehboob, Tokmurziyev et al., introduce HoverAI , a novel embodied aerial agent designed to bridge this gap. This innovative system combines drone agility with infrastructure-free visual projection and real-time conversation, allowing it to perceive users via vision and voice and respond with personalised, lip-synced avatars. Demonstrating impressive accuracy in command recognition (F1: 0.90) and demographic estimation, HoverAI represents a significant step towards creating spatially-aware, socially responsive drones capable of intuitive guidance, assistance, and truly human-centred interaction.

Drone projects lip-synced avatars for natural interaction

Scientists have unveiled HoverAI, a groundbreaking embodied aerial agent poised to redefine human-drone interaction through seamless communication and spatial awareness. The research team achieved a fully integrated platform combining drone mobility, infrastructure-independent visual projection, and real-time conversational AI, addressing the critical lack of intuitive communication mechanisms in drones operating within human spaces. HoverAI perceives users via both vision and voice, responding with dynamically adapting, lip-synced avatars projected directly from the drone itself, eliminating the need for external screens or augmented reality headsets. This innovative system employs a sophisticated multimodal pipeline, integrating voice activity detection (VAD), automated speech recognition (ASR) utilising Whisper, large language model (LLM)-based intent classification, retrieval-augmented generation (RAG) for nuanced dialogue, face analysis for personalised avatar appearance, and advanced voice synthesis with XTTS v2.
Experiments demonstrate exceptional performance across key metrics, with a command recognition F1 score of 0.90, a gender estimation F1 score of 0.89, and a remarkably low word error rate (WER) of 0.181 in speech transcription, validating the system’s accuracy and responsiveness. Furthermore, HoverAI accurately estimates user age with a mean absolute error (MAE) of only 5.14 years, showcasing its ability to personalise interactions based on demographic data. The core innovation lies in uniting aerial robotics with adaptive conversational capabilities and a self-contained visual output system, creating a spatially-aware, socially responsive agent unlike any previously developed. HoverAI’s hardware comprises a lightweight 1.2kg quadrotor equipped with an Orange Pi 5 single-board computer, a front-facing RGB camera capturing at 1080p and 30 frames per second, and a MEMS laser projector delivering 720p resolution at 30 frames per second, paired with a semi-rigid polycarbonate projection film weighing just 40 grams.

This laser scanning projection (LSP) module utilises a 2D MEMS scanning mirror, minimising weight and power consumption while maintaining image clarity, and the film remains stable despite airflow and vibrations. By eliminating reliance on external infrastructure, HoverAI opens exciting possibilities for applications in guidance, assistance, and truly human-centred interaction within dynamic environments like airports, museums, or even domestic spaces. The research establishes a new class of embodied aerial agents, moving beyond passive aerial displays and limited drone interfaces towards genuinely interactive and socially intelligent systems. This breakthrough reveals a pathway towards drones that can seamlessly integrate into our daily lives, offering intuitive and engaging experiences, paving the way for future advancements in spatial computing and human-robot collaboration. The work opens up avenues for developing drones capable of providing personalised assistance, delivering contextual information, and fostering more natural and effective communication in a variety of real-world scenarios.

HoverAI’s Quadrotor Platform and Visual Pipeline enable robust

Scientists developed HoverAI, an embodied aerial agent integrating drone mobility, infrastructure-independent visual projection, and real-time conversation to address limitations in human-occupied drone communication. The research team engineered a 1.2kg quadrotor platform equipped with an Orange Pi 5 single-board computer, a front-facing RGB camera capturing video at 1080p and 30 frames per second, and a lightweight 85g MEMS laser projector delivering 720p resolution at 30 frames per second. This innovative system projects onto a semi-rigid polycarbonate film, weighing only 40g and measuring 0.3mm thick, enabling visual display during flight without external infrastructure. The study pioneered a multimodal pipeline beginning with Voice Activity Detection (VAD) to identify speech, followed by Automatic Speech Recognition (ASR) utilising the Whisper model for transcription.

Researchers then employed a lightweight Large Language Model (LLM) for intent classification and Retrieval-Augmented Generation (RAG) to formulate contextual dialogue responses. To enhance social interaction, the team integrated real-time face analysis to personalise a lip-synced avatar projected by the MEMS laser, adapting its appearance to user demographics. This closed-loop design allows HoverAI to function as a spatially aware, socially responsive agent capable of natural interaction. Experiments demonstrated high accuracy in command recognition, achieving an F1 score of 0.90. Demographic estimation also performed well, with a gender F1 score of 0.89 and a Mean Absolute Error (MAE) of 5.14 years for age estimation.

Furthermore, speech transcription achieved a Word Error Rate (WER) of 0.181, validating the effectiveness of the integrated speech processing pipeline. The Laser Scanning Projection (LSP) module, featuring a 2D MEMS scanning mirror, delivers the 720p resolution while minimising weight and power consumption, a crucial innovation for aerial deployment. This work uniquely unites aerial robotics with adaptive conversational AI and self-contained visual output, introducing a new class of embodied agents for applications in guidance, assistance, and human-centered interaction. By harnessing onboard processing and eliminating reliance on external infrastructure, HoverAI overcomes limitations of prior drone-based interfaces and establishes a foundation for more natural and intuitive human-drone collaboration.

HoverAI demonstrates 90% accurate voice command response

Scientists have developed HoverAI, an embodied aerial agent integrating drone mobility, visual projection, and real-time conversation to address communication limitations in human-occupied spaces. Equipped with a MEMS laser projector, onboard semi-rigid screen, and RGB camera, HoverAI perceives users via vision and voice, responding with lip-synced avatars that adapt to user demographics. The system utilizes a pipeline combining Voice Activity Detection (VAD), Automatic Speech Recognition (ASR) using Whisper, Large Language Model (LLM)-based intent classification, Retrieval-Augmented Generation (RAG) for dialogue, face analysis for personalization, and XTTS v2 voice synthesis. Experiments revealed high accuracy in command recognition, achieving an F1 score of 0.90.

Data shows the system accurately distinguished commands from queries in 90% of cases, with minor confusion between “stay” and “wait” statements. Furthermore, HoverAI demonstrated robust demographic estimation, with a gender F1 score of 0.89 and a Mean Absolute Error (MAE) of 5.14 years for age estimation. Speech transcription, utilising Whisper-medium. en, achieved a Word Error Rate (WER) of 0.181 despite ambient noise levels of 45-50 dB, indicating reliable audio processing. The team measured end-to-end pipeline latency, averaging 950ms (±120ms) from speech onset to avatar response, supporting natural conversational turn-taking.

Tests prove no system crashes occurred during a total of 60 minutes of interaction time, demonstrating system stability. HoverAI’s capabilities enable applications such as museum guidance, projecting contextual narratives and multilingual subtitles aligned with physical artefacts, and assistive communication for individuals with motor or speech impairments. Researchers recorded strong performance across all modalities, as illustrated in Figure 4, with speech transcription at WER: 0.181, command recognition at F1: 0.90, gender classification at F1: 0.89, and age estimation at MAE: 5.14 years. Binary classification of age (.

Drone Projection and Conversational AI Integration offer immersive

Scientists have developed HoverAI, an embodied aerial agent integrating drone mobility with visual projection and real-time conversation. This novel system utilises a drone equipped with a MEMS laser projector, a semi-rigid screen, and an RGB camera to perceive users through both vision and voice, responding with lip-synced avatars that adapt to user demographics. The core of HoverAI combines voice activity detection, automatic speech recognition utilising Whisper, large language model-based intent classification, retrieval-augmented generation for dialogue, face analysis for personalisation, and XTTS v2 for voice synthesis, creating a cohesive interactive experience. Evaluation involving twelve participants demonstrated high accuracy in several key areas: command recognition achieved an F1 score of 0.90, demographic estimation (gender) reached an F1 score of 0.89 with an age mean absolute error of 5.14 years, and speech transcription exhibited a word error rate of 0.181.

By uniting aerial robotics with adaptive conversational AI and self-contained visual output, HoverAI establishes a new class of spatially-aware, socially responsive embodied agents with potential applications in guidance, assistance, and human-centered interaction. The authors acknowledge limitations in the system’s speech perception under diverse acoustic conditions, suggesting future work to incorporate noise-robust ASR models. Further research directions include exploring multi-drone swarm coordination for large-scale displays, investigating 3D volumetric visualisation, and broadening applicability through outdoor deployment with enhanced projection and screen stabilisation.

👉 More information
🗞 HoverAI: An Embodied Aerial Agent for Natural Human-Drone Interaction
🧠 ArXiv: https://arxiv.org/abs/2601.13801

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Spade Demonstrates Superior Sub-Rayleigh Source Discrimination with Two Incoherent Points

Spade Demonstrates Superior Sub-Rayleigh Source Discrimination with Two Incoherent Points

January 22, 2026
AI System Achieves 89.4% Safe Content Generation from Emotional Signals

AI System Achieves 89.4% Safe Content Generation from Emotional Signals

January 22, 2026
Aerosol Retrieval Efficiency Achieves Scalable JWST Analysis of Exoplanet AtmospheresAerosol Retrieval Efficiency Achieves Scalable JWST Analysis of Exoplanet Atmospheres

Aerosol Retrieval Efficiency Achieves Scalable JWST Analysis of Exoplanet Atmospheres

January 22, 2026