OpenAI’s New Model Translates 70+ Languages in Realtime Speech

OpenAI has introduced GPT-Realtime-Translate, a new model capable of translating live speech from over 70 input languages into 13 output languages, expanding beyond the capabilities of many existing real-time translation tools. This development moves toward truly conversational voice interfaces, as GPT-Realtime-2 is OpenAI’s first voice model with GPT-5-class reasoning, and achieves scores 15.2% higher on Big Bench Audio than GPT-Realtime-1.5. Developers can now build applications where voice agents understand what someone means, keep track of context, and recover when a request changes, according to OpenAI. For example, the model can respond to the prompt, “My order number is Orbit-742Q,” by repeating it back clearly for confirmation. GPT-Realtime-2 also scores 13.8% higher on Audio MultiChallenge over GPT-Realtime-1.5, and the model’s ability to handle nuanced requests is showcased in example prompts, such as planning a dinner menu with specific dietary restrictions or delivering a team announcement with varying levels of enthusiasm. BolnaAI reports, according to Co-founder & CTO Prateek Sachan, a 12.5% lower Word Error Rate with the new models. Priceline is working toward a future where travelers can manage entire trips by voice. Developers can enable features such as “preambles”, short phrases signaling the agent is processing a request, and parallel tool calls, allowing the model to perform multiple actions simultaneously while providing audible feedback, such as “checking your calendar” or “looking that up now.”

GPT-Realtime-2 Achieves 15.2% Gains in Audio Intelligence Benchmarks

A 15.2% increase in audio intelligence benchmarks suggests a new era for conversational AI, as OpenAI’s GPT-Realtime-2 demonstrates improved reasoning capabilities within voice applications. This isn’t simply about clearer transcription; the model, built upon GPT-5-class reasoning, actively understands and responds to complex prompts, moving beyond the limitations of earlier voice technologies. These gains are particularly evident in evaluations designed to mimic real-world voice agent interactions, with GPT-Realtime-2 scoring 15.2% higher on Big Bench Audio compared to GPT-Realtime-1.5. OpenAI states that Big Bench Audio evaluates challenging reasoning capabilities in language models that support audio input, suggesting a substantial advancement in the model’s ability to process and interpret spoken language. Beyond raw intelligence scores, the new models, GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, are designed to address specific challenges in building truly useful voice interfaces.

OpenAI highlights three emerging patterns in voice AI: voice-to-action, systems-to-voice, and voice-to-voice, each requiring a level of sophistication beyond simple speech recognition. The ability to handle nuanced requests is showcased in example prompts, such as planning a dinner menu with specific dietary restrictions or delivering a team announcement with varying levels of enthusiasm. Josh Weisberg, SVP and Head of AI at Zillow, says that “What stood out about GPT-Realtime-2 was the intelligence and tool-calling reliability it brings to complex voice interactions.” He notes that on their most challenging benchmark, this translates to a 26-point lift in call success rate after prompt optimization (95% versus a previous rate). This impact extends beyond single-language interactions, with GPT-Realtime-Translate supporting over 70 input languages and 13 output languages. This broad linguistic support enables live speech translation across a wide range of combinations, a significant improvement over many existing tools.

Deutsche Telekom is currently testing the model for multilingual voice interactions, seeking to create more natural cross-language conversations. Prateek Sachan, Co-founder & CTO at BolnaAI, notes that “Building voice AI for India means handling diverse regional phonetics.” In their evaluations across Hindi, Tamil, and Telugu, GPT-Realtime-Translate delivered a 12.5% lower Word Error Rate than any other model they tested, along with lower fallback rates, higher task completion, and latency that sustained natural conversation. The increased context window, expanding from 32K to 128K, allows for longer, more coherent conversations and more complex task flows, enabling more sophisticated agentic workflows.

GPT-Realtime-Translate Enables 70+ Language Realtime Voice Translation

The demand for seamless, real-time communication across linguistic barriers is rapidly reshaping how individuals and businesses interact, and while automated translation tools have existed for years, achieving truly natural and responsive voice translation has remained a significant challenge. Existing systems often struggle with nuanced speech patterns, contextual understanding, and maintaining conversational flow, resulting in clunky or inaccurate translations that hinder rather than facilitate communication. Recent advancements, however, are beginning to address these limitations, moving beyond simple transcription to systems capable of genuine comprehension and adaptive response. The model doesn’t merely convert words from one language to another, but actively processes meaning to maintain context and deliver fluent translations. This capability is particularly evident in its ability to handle complex requests and adapt to conversational shifts, a feat previously unattainable with earlier generation models.

The company highlights that GPT-Realtime-2 is the first voice offering built with GPT-5-class reasoning, enabling a level of intelligence beyond simple speech recognition. Deutsche Telekom is currently leveraging the model to explore multilingual voice interactions, aiming for a more natural feel in cross-language conversations, demonstrating a practical application beyond simple translation exercises. BolnaAI, focused on the Indian market, has found, according to Prateek Sachan, Co-founder & CTO at BolnaAI, the model particularly effective in handling the country’s diverse regional dialects. The implications extend beyond individual conversations, with potential applications in global events, education, and content creation. Vimeo, for example, is demonstrating how the model can translate product education videos live, providing access to information for a global audience without the delays associated with traditional dubbing or subtitling. These developments suggest a future where language is no longer a barrier to communication, fostering greater collaboration and understanding across cultures and communities.

Building voice AI for India means handling diverse regional phonetics. In our evals across Hindi, Tamil, and Telugu, GPT-Realtime-Translate delivered 12.5% lower Word Error Rates than any other model we tested, along with lower fallback rates, higher task completion, and latency that sustained natural conversation.

Prateek Sachan, Co-founder & CTO at BolnaAI

Voice-to-Action, Systems-to-Voice, and Voice-to-Voice AI Patterns Emerge

BolnaAI, a company focused on Indian language AI, reports that Prateek Sachan, Co-founder & CTO at BolnaAI, stated the new models achieved a 12.5% lower Word Error Rate. This emphasis on nuanced linguistic support signals a shift toward more inclusive and globally accessible voice technologies, moving beyond the limitations of models primarily trained on standard English. The emergence of these advanced voice models is driving three distinct patterns in how developers are approaching voice AI: voice-to-action, systems-to-voice, and voice-to-voice interactions. Voice-to-action applications allow users to describe a need, prompting the system to reason through the request, utilize tools, and complete the task autonomously; Zillow, for example, is developing an assistant capable of responding to requests like “find me homes within my BuyAbility, avoid busy streets, and schedule a tour for Saturday.” This level of agency represents a significant leap beyond simple voice commands, enabling more complex and personalized experiences.

Systems-to-voice, conversely, focuses on software proactively delivering spoken guidance based on contextual awareness; a travel app, for instance, could alert a traveler to a delayed flight and immediately map the fastest route to their connecting gate. These patterns aren’t mutually exclusive, and their combined potential is exemplified by Priceline working toward a future where travelers can manage entire trips by voice, encompassing flight and hotel searches, reservation adjustments, and real-time updates. The model’s ability to handle longer context windows, up to 128K, further supports more coherent and complex task flows, while adjustable reasoning levels allow developers to balance latency with the need for deliberate processing. GPT-Realtime-2 scores 15.2% higher on Big Bench Audio and GPT-Realtime-2 scores 13.8% higher on Audio MultiChallenge, both improving over GPT-Realtime-1.5. This granular control over the model’s behavior is essential for deploying robust and reliable voice agents in production environments.

Enhanced Reasoning with 128K Context Window in GPT-Realtime-2

The arrival of GPT-Realtime-2 signals a substantial leap forward in voice assistant capabilities, moving beyond simple transcription to genuine conversational understanding. OpenAI’s newest model isn’t merely reacting to spoken commands; it’s actively reasoning through requests, utilizing tools, and maintaining context over significantly extended interactions. This enhanced functionality stems, in part, from a dramatic expansion of the model’s context window, now reaching 128K. This allows GPT-Realtime-2 to sustain more coherent sessions and manage increasingly complex task flows, crucial for applications demanding nuanced understanding and prolonged engagement. The implications for real-world applications are considerable. OpenAI demonstrates what a user can ask the model with prompts requiring intricate planning, such as devising a dinner menu accommodating dietary restrictions and kitchen limitations, or crafting team announcements with varying degrees of confidence.

A particularly illustrative example involves saying “My order number is Orbit-742Q” to the model, highlighting its ability to process and recall critical details in real-time, a feature vital for customer service and task management systems. This benchmark assesses challenging reasoning capabilities in language models processing audio input. GPT-Realtime-2 scores 15.2% higher on Big Bench Audio, and GPT-Realtime-2 scores 13.8% higher on Audio MultiChallenge over GPT-Realtime-1.5. The model’s architecture incorporates features developers can enable to enhance the user experience, such as “preambles”, short phrases signaling the agent is processing a request, and parallel tool calls, allowing the model to perform multiple actions simultaneously while providing audible feedback, such as “checking your calendar” or “looking that up now.” These features contribute to a more responsive and transparent interaction. The combination of competence and safety features, according to Weisberg, makes the model viable for production voice applications.

What stood out about GPT-Realtime-2 was the intelligence and tool-calling reliability it brings to complex voice interactions. On our hardest adversarial benchmark, this translates to a 26-point lift in call success rate after prompt optimization (95% vs.

Josh Weisberg, SVP and Head of AI at Zillow
Rusty Flint

Rusty Flint

Rusty is a quantum science nerd. He's been into academic science all his life, but spent his formative years doing less academic things. Now he turns his attention to write about his passion, the quantum realm. He loves all things Quantum Physics especially. Rusty likes the more esoteric side of Quantum Computing and the Quantum world. Everything from Quantum Entanglement to Quantum Physics. Rusty thinks that we are in the 1950s quantum equivalent of the classical computing world. While other quantum journalists focus on IBM's latest chip or which startup just raised $50 million, Rusty's over here writing 3,000-word deep dives on whether quantum entanglement might explain why you sometimes think about someone right before they text you. (Spoiler: it doesn't, but the exploration is fascinating)

Latest Posts by Rusty Flint: