How Bhashini Actually Works: The Architecture, Explained Simply

Bhashini lets you speak in one Indian language and be understood in another. People assume it is one giant AI. It is not. A simple, fact-checked look at the real models behind it, the base model for translation, and how they were trained.

10 June 2026 · 6 min read

Bhashini is the government of India’s big bet on language. The idea is simple and genuinely moving: a citizen should be able to speak or type in their own language and be understood by any service, in any other Indian language. I work next to this kind of stack often, so people keep asking me how it actually works under the hood.

The most common assumption is that Bhashini is one giant AI, a single brain like ChatGPT that knows all the languages. That is not what it is. So before anything else, let me clear that up, because the real design is more interesting and far more sensible.

Everything below is from the public repositories and documentation. I have put the sources at the bottom so you can check every claim yourself.

Bhashini is a platform, not a model

The first fact to get straight: Bhashini is a collection of many specialised models stitched together behind one doorway, not a single model that does everything.

It is run by the Digital India Bhashini Division as part of the National Language Translation Mission, and it covers India’s 22 scheduled languages. The effort has produced over 380 open-source AI models across four jobs: speech recognition, machine translation, text-to-speech, and OCR.

Most of the headline models come from one place: AI4Bharat, the research lab at IIT Madras. Their models are the engines inside Bhashini, including IndicTrans2 for translation, IndicConformer for speech recognition, IndicXlit for transliteration, and IndicTTS for the spoken voice.

So when you use Bhashini, you are not talking to one mega-brain. You are using a well-organised team of specialists, each trained for one narrow job.

The three building blocks, and how they chain

Almost everything Bhashini does is built from three basic skills:

ASR (speech to text). It listens to your voice and writes down the words.
NMT (machine translation). It takes text in one language and rewrites it in another.
TTS (text to speech). It takes text and speaks it out loud.

The clever part is the pipeline. Bhashini lets you run one of these alone, or chain them together. You can ask for just ASR, or ASR plus translation, or the full chain of ASR plus translation plus TTS.

That full chain is what powers the magic demo everyone remembers. You speak a sentence in Hindi. ASR turns your speech into Hindi text. Translation rewrites it as Tamil text. TTS reads that Tamil text out loud. Speech goes in, speech comes out, in a different language, and three different models did the work in sequence.

Underneath, this is held together by ULCA, the Universal Language Contribution API. ULCA is an open data and model platform that standardises how datasets, models, and benchmarks are described, so all these pieces can plug into one another. In plain terms, it is the common language the models use to talk to the system. To use a service, an app first calls a pipeline search to find an available model chain, gets a pipeline ID, sends a config call to say which tasks it wants in which order, and then sends the actual audio or text to be processed.

The translation engine up close: IndicTrans2

Since translation is the heart of Bhashini, let me open up that one model, because this is where the “base model” question gets a clear answer.

The base for translation is IndicTrans2, built by AI4Bharat. It is described as the first open-source transformer-based multilingual translation model that covers all 22 scheduled Indic languages. The word “transformer” matters: it is the same fundamental design that sits under most modern AI, an encoder-decoder neural network, where the encoder reads the source sentence and the decoder writes the translated one.

A few concrete, checkable facts about it:

It comes in sizes. The full base models are around 1 billion parameters (separate ones for English to Indic, Indic to English, and Indic to Indic). There are smaller distilled versions of roughly 200 to 320 million parameters for when you need speed and lower cost.
It handles 22 languages across five different scripts, and uses script unification and shared lexical features so that low-resource languages like Kashmiri, Manipuri, and Sindhi benefit from the others.
It uses SentencePiece to break text into tokens, with separate tokenizers for the English side and the Indic side.
The model checkpoints are released under the MIT licence, which is about as open as it gets.

That last point is the quietly radical bit. A national translation engine, fully open, that anyone can download and run.

How the translator was actually trained

A model is only as good as what it learns from, so here is the training story, again from the public record.

IndicTrans2 was trained on the Bharat Parallel Corpus Collection, or BPCC, which holds roughly 230 million bitext pairs. A bitext pair is just one sentence and its translation sitting side by side, which is exactly what a translation model needs to learn from.

That 230 million splits into two very different kinds of data:

BPCC-Mined, around 228 million pairs, gathered automatically by scanning huge amounts of text and matching sentences that mean the same thing. Cheap and enormous, but a bit noisy.
BPCC-Human, around 2.2 million pairs, translated by people. Small, but gold standard, and worth far more per sentence than the mined data.

On top of that, they used back-translation, a neat trick where you take an early version of the model, use it to translate extra text, and feed those machine-made pairs back in as more training data. It sounds circular, but it reliably makes translation models stronger when human data is scarce.

So the recipe, in one line: a huge pile of automatically mined pairs, a smaller pile of human-perfect pairs, and a dose of self-generated back-translation to fill the gaps.

The ear and the voice

Two quick notes on the speech side, to complete the picture.

The listening is done by IndicConformer, a speech-recognition model with only about 30 million parameters, which is small and fast on purpose so it can run in real time. The “Conformer” name refers to its design, which blends convolution (good at local sound patterns) with attention (good at the bigger context). It was trained on Indian speech datasets including KathBath, Shrutilipi, and MUCS.

The speaking is done by IndicTTS, AI4Bharat’s text-to-speech models, which turn the translated text back into a natural-sounding voice in the target language.

Why the design is smart

Step back and the architecture makes a lot of sense. Instead of one impossible-to-train giant that must do everything, Bhashini uses small, focused models that each do one job well, chained together as needed, described through one common standard so they are swappable. A better translation model can be dropped in without touching the speech models. And because it is open source, the whole country can build on it rather than rent it.

It is not magic, and it is not one brain. It is a team of well-trained specialists standing behind a single doorway, organised cleanly enough that you never see the seams.

That, to me, is the more impressive achievement.

Sources

AI4Bharat, IndicTrans2 repository: github.com/AI4Bharat/IndicTrans2
Bhashini DIBD, ULCA repository: github.com/bhashini-dibd/ulca
Bhashini API documentation (pipelines): bhashini.gitbook.io/bhashini-apis
AI4Bharat models: models.ai4bharat.org
IndicConformer model collection: huggingface.co/collections/ai4bharat/indicconformer