Berlin-based Parloa is addressing a central challenge in voice AI: consistent performance in real-world customer interactions. Stefan Ostwald, Parloa’s co-founder, discovered this need firsthand after spending a day immersed in an insurance call center, observing repetitive requests for tasks like password resets and policy questions. This experience drove the development of Parloa’s Agent Management Platform (AMP), designed to empower non-technical business experts to build and deploy AI agents without coding. “The models only matter if they work in production,” says Ciaran O’Reilly Ibañez, Engineering Manager at Parloa, explaining the company’s close collaboration with OpenAI to optimize models for speed and reliability, utilizing simulations with models like GPT‑5.4 to rigorously test agents before they handle live customer conversations.
Rule-Based Origins & Transition to Natural Language Agents
Early voice AI relied on painstakingly crafted rules, but a shift toward natural language understanding is now enabling more adaptable and scalable customer service solutions. Parloa’s origins illustrate this evolution; the company began by constructing rule-based voice agents designed to automate routine, high-volume interactions after co-founder Stefan Ostwald observed repetitive tasks within an insurance call center. This initial approach, while functional, proved limiting, prompting a move toward systems that could interpret and respond to customer needs with greater nuance. Instead of rigidly mapping out conversational flows, Parloa’s Agent Management Platform (AMP) allows teams to define agent behavior using natural language, connecting these agents to internal systems for rapid iteration and testing. The platform utilizes simulations, leveraging models like GPT‑5.4, with one model acting as the caller and another running the configured agent, to test agent performance against realistic customer scenarios before live deployment.
Parloa prioritizes consistent performance in production environments, recognizing that theoretical benchmarks don’t always translate to real-world success. Ibañez adds, “We work closely with OpenAI on how to make the models fast and reliable enough for real-time conversations.” This collaborative effort focuses on optimizing models for speed and reliability, addressing a key challenge for voice AI and maintaining consistent performance under real-world conditions. Parloa’s modular approach, separating tasks into distinct sub-agents, further enhances instruction-following and simplifies system evolution, ensuring that even complex agents remain manageable and predictable at scale.
Agent Management Platform (AMP) Enables No-Code AI Building
Beyond initial experimentation with rule-based systems, Parloa has developed an Agent Management Platform (AMP) designed to democratize AI agent creation, shifting power to subject matter experts rather than relying solely on developers. This platform addresses a critical gap in the current AI landscape, where many solutions require substantial coding expertise, limiting broader adoption within organizations. This approach facilitates rapid iteration and testing through built-in simulations and evaluations, moving beyond static intent mapping. The platform manages the entire AI agent lifecycle, enabling teams to define an agent’s role, instructions, tools, and boundaries in a streamlined manner. Parloa emphasizes rigorous testing before deployment, utilizing models like GPT‑4.1, GPT‑5‑mini, and others to simulate customer interactions. These simulations aren’t merely theoretical; they are designed to mirror real-world scenarios, evaluating instruction-following, tool usage, and task completion.
Matthäus Deutsch, Senior Applied Scientist, highlights the importance of practical validation, stating, “It’s very important for us that things do not only work in theoretical benchmarks but in actual real use cases.” This evaluation-first methodology has demonstrably improved performance, with one global travel company achieving an 80 percent reduction in requests for a human agent. Parloa also collaborates directly with OpenAI, focusing on optimizing models for the unique demands of real-time voice conversations, ensuring speed and reliability at scale.
The models only matter if they work in production. We work closely with OpenAI on how to make the models fast and reliable enough for real-time conversations.
Ciaran O’Reilly Ibañez, Engineering Manager at Parloa
GPT-5.4 Simulations & LLM-as-a-Judge Evaluation Pipelines
This direct exposure prompted Parloa to move beyond initial rule-based agents toward a more dynamic approach, leveraging advanced language models to handle complex customer interactions. The company now utilizes simulations, powered by models such as GPT‑5.4, with one model acting as the caller and another running the configured agent, to rigorously test agent performance before deployment, ensuring consistent and reliable service. These simulations aren’t merely theoretical exercises; Parloa mirrors real production agents, subjecting them to tests measuring instruction-following, API consistency, and latency under realistic conditions. A core component of this testing is the implementation of “LLM-as-a-judge” scoring, alongside deterministic checks, to assess whether agents correctly followed instructions, utilized tools effectively, and successfully completed tasks. This evaluation-first approach is particularly vital for enterprise clients, where stability is paramount. “When a new model comes out, we run our benchmarking suite against it,” says Senior Applied Scientist Matthäus Deutsch.
When a new model comes out, we run our benchmarking suite against it.
Matthäus Deutsch, Senior Applied Scientist
Low-Latency Voice Pipeline & Multilingual Global Deployment
The demand for seamless voice AI experiences is driving innovation beyond model capability, focusing intensely on real-world performance. Parloa addresses this challenge with a system engineered for minimal delay, recognizing that even slight pauses within a voice interaction can significantly impact customer perception. Every component, from speech-to-text conversion to text-to-speech synthesis, undergoes rigorous independent testing; speech-to-text systems, for example, are assessed for word error rates, particularly when processing critical information like account numbers. These evaluations extend beyond isolated metrics, however, as Parloa mirrors live production environments to simulate realistic conditions. This commitment to practical application stems from a hands-on understanding of the customer service landscape. This experience informed the development of a platform prioritizing consistent performance, not just theoretical benchmarks. To that end, the company collaborates closely with OpenAI, specifically to optimize models for both speed and dependability in real-time conversations.
Parloa’s Agent Management Platform (AMP) is designed to facilitate global deployments, with benchmarks conducted across multiple languages to ensure consistent performance regardless of region. This multilingual focus reflects the needs of their enterprise clients, who expect uniform service quality worldwide. Currently, Parloa’s agents manage millions of conversations across sectors like retail, travel, and insurance, handling everything from automated support to revenue-generating teleshopping interactions. The platform’s architecture allows for unified customer journeys, seamlessly transitioning between phone, chat, and interactive elements.
With AMP, we can have subject matter experts from different business units actually build the agents and connect the APIs in a much leaner and simpler way.
Ciaran O’Reilly Ibañez, Engineering Manager at Parloa
Source: https://openai.com/index/parloa/
