Skip to main content
AI Jun 10, 2026 4 min read 22 views

New Benchmark Reveals Which Voice AI Models Fail at Code-Switched Speech

ASR code-switching multilingual voice agents HuggingFace Whisper bilingual AI
New Benchmark Reveals Which Voice AI Models Fail at Code-Switched Speech
HuggingFace and ServiceNow benchmark shows OpenAI Whisper, Google Chirp struggle with bilingual code-switched speech. Developers need new strategies.

Frontier ASR Models Struggle with Bilingual Conversations

According to a detailed benchmark published by HuggingFace and ServiceNow AI, today's most advanced automatic speech recognition (ASR) systems—including OpenAI's Whisper large-v3, Google's Chirp, and Meta's MMS—show significant accuracy drops when processing code-switched speech, where speakers alternate between two languages mid-conversation. The study found that accuracy plummets by as much as 35% on common language pairs like English-Spanish and English-Mandarin compared to monolingual performance, exposing a critical blind spot in voice agent deployments for global markets.

What the Benchmark Measures

The team at ServiceNow AI constructed a custom dataset of real-world bilingual customer service recordings, spanning three language pairs: English-Spanish, English-Mandarin, and English-Hindi. Each sample contained natural code-switching—where a speaker might say, “Can you check the status of mi pedido?” within the same utterance. The benchmark evaluated word error rate (WER), speaker diarization accuracy, and language identification precision for models including Whisper large-v3 (OpenAI), Chirp (Google), MMS (Meta), and two proprietary models from ServiceNow.

Key findings include:

  • Whisper large-v3 achieved a WER of 14.2% on English-Spanish code-switched speech, compared to 6.8% on monolingual English.
  • Google's Chirp dropped from 7.5% WER (monolingual) to 18.9% on English-Mandarin code-switching.
  • Language identification accuracy fell below 70% for all models on English-Hindi mixtures.
  • None of the models correctly handled intra-sentential switching (switching languages mid-word) in more than 40% of cases.

Why This Matters for AI Developers and Businesses

For developers building voice assistants, call center automation, or multilingual customer support bots, this benchmark carries a sobering message: deploying a single ASR model across bilingual regions without code-switching testing will result in poor user experiences and inflated error rates. The benchmark paper notes that many production pipelines rely on language-specific models tuned with monolingual data, which introduces cascading failures—if the language identifier incorrectly tags a Spanish phrase as English, downstream natural language understanding (NLU) systems that rely on language-specific pipelines will misinterpret user intent entirely.

Businesses with global customer bases—especially in markets like the US Southwest, India, Singapore, and parts of Europe—face the highest risk. For example, a Spanish-English bilingual customer calling a telecom helpline might be incorrectly routed to an English-only agent if the ASR fails to detect the code-switched query, leading to frustration and higher abandonment rates.

Where the Current Leaders Fall Short

Whisper large-v3, widely considered the gold standard for open-source ASR, performed best overall but still exhibited a 2x WER increase on code-switched segments. Google's Chirp, which relies on acoustic and language models separately, showed better monolingual accuracy but worse code-switch detection—likely because its language identification module was trained on clean, single-language utterances. Meta's MMS, built on a massive multilingual dataset, fared better on African language pairs but lagged on high-resource pairs like English-Spanish, suggesting uneven training data coverage.

Importantly, the benchmark also revealed that current end-to-end models (Whisper, Chirp) outperformed cascade systems (separate acoustic model + language ID + decoder) by 12% on average, but still fail on switching patterns common in real conversations, such as clause-level switching (“I think, pero no estoy seguro”) and tag-switching (“Yes, está bien”).

What Developers Can Do Now

The HuggingFace blog post provides actionable recommendations for teams deploying voice agents in bilingual contexts:

  • Test with code-switched datasets before production. ServiceNow has open-sourced a subset of their benchmark for community use.
  • Consider using ensemble models that combine a primary ASR with a separate language identification system trained on mixed-language data.
  • Implement post-processing correction layers using large language models (LLMs) that can infer the correct language from context when ASR output is ambiguous.
  • For high-stakes applications like financial services or healthcare, maintain human-in-the-loop fallback for utterances with low confidence scores in code-switched segments.

Early adopters of this approach, including ServiceNow's own production pipeline, have reported a 20% reduction in misrouted calls in Spanish-English service centers.

The Road Ahead for Multimodal Speech AI

This benchmark arrives as the industry races toward multimodal agents that combine speech, text, and vision. The code-switching blind spot poses an even larger challenge for these systems—if a voice agent cannot reliably transcribe a bilingual query, it cannot feed accurate data to a vision-language model or a reasoning engine. The research community is already responding: new training strategies such as language-adversarial training and contrastive pretraining on mixed-language audio are showing promise in internal experiments.

For now, the message to AI teams is clear: if your voice agent serves bilingual users, you cannot assume your ASR system handles code-switching well. Independent benchmarks like this one serve as an essential reality check, revealing that the next frontier for speech AI isn't just more languages—it's the seamless blending of languages that real humans use every day.

Source: HuggingFace Blog. This article was produced with AI assistance and reviewed for accuracy. Editorial standards.

Avatar photo of James Whitfield, contributing writer at AI Herald

About James Whitfield

James Whitfield is a senior software engineer with 8 years of experience building developer tools, CLI applications, and IDE extensions. He has contributed to open source projects including VS Code extensions and GitHub Actions workflows. Currently covers AI developer tools, coding assistants, and platform engineering for AI Herald.

Related articles