How Vodafone discovered 33% of TOBi's answers were wrong

67%→85%

response accuracy

+NPS

from negative to positive

Vodafone's virtual assistant, TOBi, handles millions of customer interactions across multiple markets. On paper, it was performing well. Intent accuracy sat around 96%, fallback rates hovered near 5%, and the dashboards pointed to a well-trained, stable AI.

But customers told a different story. NPS stayed negative. Feedback kept repeating the same theme: the assistant wasn't helping. TOBi wasn't failing to respond –– it was failing while appearing to succeed.

Why high intent accuracy didn't mean high quality

When Vodafone partnered with Inquio, the focus shifted from model metrics to actual conversations. Instead of asking whether the assistant answered, the analysis asked a harder question: was the answer correct and useful?

Reviewing real interactions made the problem obvious. Users weren't hitting dead ends. They were being led into them. TOBi responded confidently, often fluently, but with answers that didn't solve the problem. Phrases like "This doesn't help" and "You're just repeating the same thing" weren't edge cases. They were patterns.

What Inquio's analysis uncovered

Only 67% of TOBi's answers were actually correct. The remaining 33% were false positives, responses that looked valid but weren't.

Nearly 70% of those were hallucinations, where the AI generated plausible but incorrect answers
The rest came from gaps in conversation design or unclear guidance

The system had been optimized to minimize silence, not to maximize correctness. In practice, that meant users were getting answers even when the assistant didn't understand or couldn't help.

"Inquio allows us to better understand conversations with our users, which helps us make sure our bot responds more accurately"

Lukas Spurny

Product Owner

How fixing bot accuracy changed the customer experience

By shifting the focus from answering everything to answering correctly, Vodafone changed the experience fundamentally. Response accuracy improved from 67% to around 85%, false positives dropped, and NPS moved from negative into positive territory.

More importantly, the perception of TOBi changed. It went from something that blocked users to something that could actually help. In conversational AI, a wrong answer doesn't just fail. It erodes trust.