When AI Finally Speaks Indian
India has 22 officially recognized languages. Millions of people think and dream in more than one language at once. They code-switch constantly, mid-sentence, mid-thought. The major AI platforms handle this, well sometimes. On a good day. Sarvam.AI was built specifically to fix this.
I get a lot of emails pitching AI platforms. Most of them are variations on the same theme - faster inference, cheaper tokens, better reasoning. I skim them and move on.
Then this morning I got an email forwarded from Sneha at Sarvam.AI. The subject line was "Meet the Sarvam models." I almost skimmed that one too. I'm glad I didn't.
What Sarvam Has Actually Built
The platform is a suite of specialized models, each doing one job exceptionally well.
Saarika is their speech-to-text engine. It transcribes audio in 11 Indian languages - and critically, it handles code-switching naturally. If your speaker starts a sentence in Hindi and finishes it in English, Saarika doesn't get confused. It just transcribes.
Saaras goes a step further. It takes speech in any Indian language and converts it directly to English text. Not transcribe-then-translate. Directly. That might sound like a small efficiency gain. For teams that need instant English documentation from regional language conversations - think customer calls, field interviews, regional media - it's a significant capability difference.
Bulbul is their text-to-speech model. Eleven languages, natural-sounding voices, correct pronunciation and tone. The kind of output that doesn't make a native speaker wince.
Sarvam Translate handles text translation across all 22 official Indian languages. What sets it apart is that it's trained to handle both casual conversation and formal register - government documents, legal text, everyday chat - without needing separate model configurations.
Sarvam-M is their 24-billion parameter language model. Built on Mistral-Small and post-trained extensively on Indian languages and cultural context. It supports a hybrid reasoning mode - you can run it in fast conversational mode or engage its chain-of-thought reasoning for more demanding tasks. On Indian language benchmarks, it reportedly shows 20%+ improvement over comparable models, and significantly stronger results on tasks that mix Indian languages with mathematics or structured reasoning.
The thing that struck me reading all of this: these aren't features bolted onto a general-purpose model. They're purpose-built tools for a specific, underserved linguistic context.
From Email to Deployed App in a Day
I decided the only honest way to evaluate this was to build something with it.
The concept was simple: take an audio clip from Indian news media, run it through the Sarvam pipeline, and see what came out the other end. Transcription, analysis, translation, text-to-speech. The full chain.
I grabbed a Hindi financial news clip - about 24 seconds of a presenter discussing market conditions, FII selling pressure, and global indices. Genuine broadcast-quality Hindi, the kind of rapid-fire delivery you'd hear on Zee Business or NDTV Profit.
The first thing I learned is that the Sarvam Playground UI was having a moment - blank page after upload. So I went straight to their Python SDK. Within a few lines of code I had a working Saarika call. The transcript came back in clean Devanagari:
बात सुबह की टॉप हेडलाइंस के साथ। भारतीय बाजारों के लिए आज भी कमजोर संकेत...
Feeding that into Sarvam-M with a structured analysis prompt produced something that impressed me more than the transcription itself:
Language detected: Hindi
Tone: Negative
Topics: Market Trends, Global Indices
Key Entities: FII, GIFT Nifty, Dow Jones, Nasdaq, S&P 500
Summary: Indian markets opened weak with FII selling of approximately ₹3,500 crore. GIFT Nifty faced pressure alongside declining Asian markets and US indices including the Dow Jones (down ~300 points), Nasdaq, and S&P 500.
The entity extraction was accurate. The tone classification was correct. The summary was right on target with the source. And it had handled the financial terminology - ₹3,500 crore, FII, GIFT Nifty - without hallucinating or fumbling.
The Project: Sarvam Explorer
Once the pipeline proved out, I built a proper web interface around it. The goal was a tool my team could use to explore the full Sarvam capability stack without touching any code.
The flow is deliberately two-step.
In the first step, you upload an MP3 and choose a transcription language. English is the default, but you can change it to any of the 11 supported languages - and the transcript will come back in that language, in the correct script. Select Hindi and you get Devanagari. Select Tamil and you get Tamil script. The analysis panel - language, tone, topics, entities, summary - appears alongside it.
In the second step, you choose a target language for translation and text-to-speech. This dropdown automatically excludes whichever language you used for transcription (no point translating English to English). Select your target, hit Submit, and Bulbul generates the audio. The output renders as an interactive waveform - you can click to seek, play, pause, and download the MP3.
The interface takes visual inspiration from the painter Bhupen Khakhar - vivid Indian palette, crimson header, cobalt and gold accents. It felt right for a tool built around Indian languages.
The whole application - FastAPI backend, single-page frontend, SQLite session history, Docker configuration - is very simple, lightweight and can be run locally or on a server.
What This Actually Demonstrates
Running a Hindi financial news clip through this pipeline and watching it emerge as a Telugu audio file - accurately translated, naturally spoken - is genuinely striking the first time you see it.
But what I think it demonstrates more than anything is a gap in how the global AI industry has thought about language support.
Supporting a language isn't the same as understanding it. Sarvam-M doesn't just process Indian languages - it's been trained with Indian cultural context, Indian linguistic patterns, the way Indians actually communicate. That includes code-switching, informal register, regional idioms, and the specific vocabulary of Indian institutions and markets.
For anyone building AI products for Indian users - and the scale of that opportunity is hard to overstate - this is a meaningful distinction. A model that understands FII selling pressure or GIFT Nifty in context is a different thing from a model that can tokenize Hindi.
That said, it isn't flawless. Sanskrit-derived vocabulary and formal Devanagari text showed rough edges in translation. Some regional Telugu and Kannada words were either mispronounced by Bulbul or came out garbled entirely. These are real limitations worth knowing before you build on it. But for everyday spoken Indian languages, especially Hindi, the results were genuinely strong, and the overall direction is unmistakably right.
What's Next
This project was built to explore. The next step is to think about where this pipeline fits into larger AI-native workflows.
The model combination that interests me most for my work I've been doing: voice input in any Indian language → English reasoning via Sarvam-M → translated voice output via Bulbul. That's a complete regional language loop, built entirely on purpose-trained Indian language models. No English-centric general purpose model in the middle trying to approximate cultural context it wasn't trained on.
Sarvam offers API access and their documentation is genuinely good. If you're building anything in this space, it's worth a look.
The Sarvam Explorer source code is on GitHub at github.com/knightsri/sarvam-explorer.