Frontier ASR Models Struggle with Bilingual Conversations
According to a detailed benchmark published by HuggingFace and ServiceNow AI, today's most advanced automatic speech recognition (ASR) systems—including OpenAI's Whisper large-v3, Google's Chirp, and Meta's MMS—show significant accuracy drops when processing code-switched speech, where speakers alternate between two languages mid-conversation. The study found that accuracy plummets by as much as 35% on common language pairs like English-Spanish and English-Mandarin compared to monolingual performance, exposing a critical blind spot in voice agent deployments for global markets.
What the Benchmark Measures
The team at ServiceNow AI constructed a custom dataset of real-world bilingual customer service recordings, spanning three language pairs: English-Spanish, English-Mandarin, and English-Hindi. Each sample contained natural code-switching—where a speaker might say, “Can you check the status of mi pedido?” within the same utterance. The benchmark evaluated word error rate (WER), speaker diarization accuracy, and language identification precision for models including Whisper large-v3 (OpenAI), Chirp (Google), MMS (Meta), and two proprietary models from ServiceNow.
Key findings include:
- Whisper large-v3 achieved a WER of 14.2% on English-Spanish code-switched speech, compared to 6.8% on monolingual English.
- Google's Chirp dropped from 7.5% WER (monolingual) to 18.9% on English-Mandarin code-switching.
- Language identification accuracy fell below 70% for all models on English-Hindi mixtures.
- None of the models correctly handled intra-sentential switching (switching languages mid-word) in more than 40% of cases.
Why This Matters for AI Developers and Businesses
For developers building voice assistants, call center automation, or multilingual customer support bots, this benchmark carries a sobering message: deploying a single ASR model across bilingual regions without code-switching testing will result in poor user experiences and inflated error rates. The benchmark paper notes that many production pipelines rely on language-specific models tuned with monolingual data, which introduces cascading failures—if the language identifier incorrectly tags a Spanish phrase as English, downstream natural language understanding (NLU) systems that rely on language-specific pipelines will misinterpret user intent entirely.
Businesses with global customer bases—especially in markets like the US Southwest, India, Singapore, and parts of Europe—face the highest risk. For example, a Spanish-English bilingual customer calling a telecom helpline might be incorrectly routed to an English-only agent if the ASR fails to detect the code-switched query, leading to frustration and higher abandonment rates.
Where the Current Leaders Fall Short
Whisper large-v3, widely considered the gold standard for open-source ASR, performed best overall but still exhibited a 2x WER increase on code-switched segments. Google's Chirp, which relies on acoustic and language models separately, showed better monolingual accuracy but worse code-switch detection—likely because its language identification module was trained on clean, single-language utterances. Meta's MMS, built on a massive multilingual dataset, fared better on African language pairs but lagged on high-resource pairs like English-Spanish, suggesting uneven training data coverage.
Importantly, the benchmark also revealed that current end-to-end models (Whisper, Chirp) outperformed cascade systems (separate acoustic model + language ID + decoder) by 12% on average, but still fail on switching patterns common in real conversations, such as clause-level switching (“I think, pero no estoy seguro”) and tag-switching (“Yes, está bien”).
What Developers Can Do Now
The HuggingFace blog post provides actionable recommendations for teams deploying voice agents in bilingual contexts:
- Test with code-switched datasets before production. ServiceNow has open-sourced a subset of their benchmark for community use.
- Consider using ensemble models that combine a primary ASR with a separate language identification system trained on mixed-language data.
- Implement post-processing correction layers using large language models (LLMs) that can infer the correct language from context when ASR output is ambiguous.
- For high-stakes applications like financial services or healthcare, maintain human-in-the-loop fallback for utterances with low confidence scores in code-switched segments.
Early adopters of this approach, including ServiceNow's own production pipeline, have reported a 20% reduction in misrouted calls in Spanish-English service centers.
The Road Ahead for Multimodal Speech AI
This benchmark arrives as the industry races toward multimodal agents that combine speech, text, and vision. The code-switching blind spot poses an even larger challenge for these systems—if a voice agent cannot reliably transcribe a bilingual query, it cannot feed accurate data to a vision-language model or a reasoning engine. The research community is already responding: new training strategies such as language-adversarial training and contrastive pretraining on mixed-language audio are showing promise in internal experiments.
For now, the message to AI teams is clear: if your voice agent serves bilingual users, you cannot assume your ASR system handles code-switching well. Independent benchmarks like this one serve as an essential reality check, revealing that the next frontier for speech AI isn't just more languages—it's the seamless blending of languages that real humans use every day.
Source: HuggingFace Blog. This article was produced with AI assistance and reviewed for accuracy. Editorial standards.