Jan 29, 2025

Ten Trillion Tokens: Making AI Work for Every Indian Language

Written by:

Sagar Sarkale

To serve India's diverse population, AI needs to understand Indian languages as well as it understands English. We're collecting ten trillion tokens of language data –– from everyday conversations to technical documents –– across India's major languages. This data will power the next generation of AI services that work for all Indians.

Building AI that works for India requires something different than what works in English. While English data is everywhere on the internet, making it easy to train AI models, the story changes completely when we look at India's needs.

Our country has 22 major languages, each with its own script, grammar rules, and cultural context –– and the current AI approaches simply don't work well enough for this diversity.

When someone in Tamil Nadu asks an AI for help with a government form, or a farmer in Maharashtra needs crop advice, they should be able to do it in their own language, naturally and easily. But right now, that's not possible. The AI models we have today stumble with Indian languages because they were built mainly for English.

That's why we started the "Ten Trillion Token" project. We're building the foundation for AI that can properly understand and work with Indian languages - from formal government documents to casual conversations at the local tea shop. Our goal is to collect and organize the massive amount of data needed to make AI work well for everyone in India, no matter what language they speak.

Quick note: Tokens are the basic building blocks that AI uses to understand language –– they're usually parts of words or sometimes whole words. When we say we need ten trillion of them, we mean we need examples of Indian languages being used in all sorts of real situations.

So, why do we need that many tokens? Ten trillion is a huge number.

Let's break it down with a real example. Meta's latest AI model, Llama 3, used 15 trillion tokens during training. But here's the key point: 95% of that data was in English, with the remaining 5% split across 30 different languages.

Now think about India's needs. We have 22 major languages, and each one needs enough data for AI to understand everything from formal government documents to local slang. Each language has its own set of rules, cultural references, and ways of expressing ideas.

When you look at it that way, ten trillion tokens starts to make more sense –– we need enough data to help AI understand each language deeply, not just on the surface.

But what does this mean in practice? Each language needs:

Enough examples to learn proper grammar and structure
Data that shows how people actually talk in real situations
Content that covers everything from news articles to casual conversations
Examples of regional variations and dialects

Workstreams of Ten Trillion Tokens

So how do we get to ten trillion tokens? The journey involves three main workstreams, each serving a unique purpose in building India's AI data foundation. Let's break down our approach and understand what type of data we need.

Web Scraping: The Foundation Layer

First, let's talk about the easiest place to start –– web scraping. While this requires minimal effort, it's not as simple as downloading everything we find. We need knowledge-rich content in Indian languages, the go-to source for this would be well-written articles, academic papers, and formal documents.

Think of it as finding the "Wikipedia quality" content, but for Indian languages.

But here's the interesting part –– we can supplement this with machine translation. By taking high-quality Indian content in English and translating it to other Indian languages, we could potentially generate up to 75% of our required data.

As far as formal language usage is concerned –– about 10-100 billion tokens per language should give us enough examples of proper grammar and structure, sufficient for a model to learn the same.

On-Ground Collection: The Cultural Layer

Web scraping alone won't capture the richness of Indian culture. That's where our moderate-effort strategy comes in –– going directly to the source.

Imagine collecting 10,000 questions from each village or district, asking things like "What happens during Diwali from 6 AM to 12 PM?", “How does your community celebrate the first harvest festival?” or “How does a wedding look like in your family, what are the specific rituals of each day of the wedding?”

This work is about capturing cultural context. When we ask about tourist places, we're documenting not just locations but the stories, beliefs, and traditions that make them significant in Indian life.

This data is irreplaceable and cannot be generated synthetically. The diversity of these responses will help our AI models understand the unique aspects of life across India.

Use Cases: The Application Layer

Our most effort-intensive work focuses on real-world applications. We're studying systems like IRCTC (Indian Railway), helpline services, and other practical applications that serve Indians in their native languages.

These use cases will generate between 1 to 10 million tokens each, but more importantly, they show us how Indians actually interact with technology in their daily lives.

To build applications that truly help people in India, we first need to understand the domains that shape their lives.

For example, everyone needs medical care at some point. If we're building a medical assistant, we need to know: Where will the assistant get its knowledge? How will it help people speaking different languages? How should it respond to queries in various dialects?

This is why collecting use case data goes beyond just recording conversations. We're capturing the full spectrum of how Indians interact with essential services. When someone books a train ticket, asks about government schemes, or seeks medical advice, they bring their entire linguistic and cultural context to that interaction.

Sometimes they use formal language, sometimes it's mixed with English, and sometimes it's purely colloquial –– all these patterns matter.

Data which is gathered in this layer is crucial, as this forms a major part of the end user’s experience.

Our work so far

We've already started building a framework that maps out exactly what kind of data we need and where to get it. It's similar to planning a massive digital library –– you need the right balance of different types of content in each language, from academic papers to everyday conversations.

Here's what we need for each language:

Between 1 and 10 trillion tokens of general language data - this covers everything from news articles to blog posts
1 million pairs of questions and answers - showing how people actually interact
Between 10,000 and 100,000 hours of recorded speech - capturing how people really talk

Voice data plays a special role in India. Most Indians prefer to speak rather than type, and our oral traditions run deep. Every few kilometers, you'll hear different accents and dialects –– each carrying its own cultural significance.

We're recording voices from across India: street conversations, village meetings, customer service calls, and local discussions. Each recording helps AI understand the subtle differences in how people speak.

For example, when someone calls a helpline about a train ticket:

They might switch between languages mid-sentence
Use local terms that mean different things in different regions
Have an accent that's unique to their area

These speech patterns tell us as much about how people communicate as their choice of words. By capturing these voices, we're helping AI understand not just what people say, but how they say it.

Shifting perspective to use cases

At People+ai, we take a practical approach to data collection. We build and test AI systems that help people right now, while gathering data that improves future systems. Here's how it works:

Our approach to use cases

We start by finding areas where AI can help the most people. Take train bookings - millions of Indians could book tickets more easily if they could do it in their own language. Or healthcare –– imagine speaking to a medical helpline in your local dialect.

We build test systems for these scenarios, and each conversation helps us understand what works and what doesn't.

The feedback from real users guides our development. When a farmer in Karnataka asks about crop diseases in Kannada, we learn exactly how agricultural terms are used locally. This helps us build better systems for other farmers in the future.

Choosing the right projects

We pick our projects carefully, looking at three main factors:

Population impact: We focus on services that help millions of people. Government service access in local languages, for example, could help people across India get information about benefits, documents, and programs more easily.

Data quality: Some interactions give us better data than others. A detailed conversation about health symptoms teaches us more than a simple yes/no question. We look for use cases that generate rich, natural language data.

Practical implementation

Before we launch any project, we carefully evaluate whether we can implement it effectively. This means checking three key boxes: First, can we deploy the system safely while protecting everyone's privacy? We need robust security measures and clear data protection policies.

Second, do we have –– or can we build –– the partnerships needed to make it work? Success often depends on working with local organizations, government bodies, or other institutions who understand the community's needs.

Finally, we look at scalability. If a project succeeds in one region or language, we need a clear path to expand it to others. These practical checks help us avoid starting projects that might hit roadblocks later, ensuring we focus our efforts where they'll have the most impact.

Our best projects share four key features:

They generate 1-10 million high-quality language examples
They solve real problems that people face today
They can work in multiple Indian languages
They create ongoing opportunities to collect more data

This practical approach helps us build useful services while gathering the data we need. Each successful project leads to better AI systems, which then help us create even better services. It's a continuous cycle of improvement, all driven by real people using these systems in their daily lives.

Making ten trillion tokens a reality

A project this size needs more than just technical solutions - it needs the right policies and partnerships. Right now, valuable data sits unused in government offices, educational institutions, and private companies. Here's the challenge: how do we unlock this data while protecting privacy and maintaining quality?

Opening up data: The policy challenge

Right now, there is a lot of valuable Indian language data locked away in different systems. Government departments have records in multiple languages. Educational institutions have teaching materials. Private companies have customer interactions. But there is no clear way to share this data safely and legally.

This is where policy comes in. We need frameworks that make it easy and safe to share data. Think about it like this –– when a citizen calls a government helpline in Tamil, that conversation could help build better AI for millions of Tamil speakers. But first, we need policies that say:

How this data can be shared
How to protect people's privacy
Who can use the data and for what
How to maintain quality standards

Get involved in the Ten Trillion Tokens project

Whether you're an institution, an individual, or a content owner, there's a role for you in this project.

Institutions can make a big impact by:

Running workshops that generate high-quality language content
Converting their existing materials into digital formats
Adding data collection to their current services

If you're an individual who wants to help:

Join our translation projects –– we need speakers of all Indian languages
Share your voice recordings to help capture different accents and dialects
Add your expertise in specific fields like medicine, law, or technology

For organizations that own content:

Share your language resources with proper licensing
Partner with us on specific data collection projects
Help us build better tools for data gathering

Ways to get involved

Every great solution starts with someone seeing a possibility. That's how the Ten Trillion Token project began, it's how all our projects begin. Good ideas can come from anywhere, which is why we need your help.

Maybe you've noticed gaps in how public services reach people, or see opportunities to make healthcare more accessible in local languages. Your understanding of real challenges could help create AI solutions that work for all of India. Add your idea to the Use Case Garden –– this could be the starting point for something transformative.

Submit your idea here.

From India To The World

Our work is designed around the belief that technology, especially AI, will cause paradigm shifts that can help people reach their potential. Join us in building AI systems that work for billions.

Get Involved

An EkStep Foundation Initiative

Youtube

Instagram

Twitter

From India To The World

Our work is designed around the belief that technology, especially AI, will cause paradigm shifts that can help people reach their potential. Join us in building AI systems that work for billions.

Get Involved

An EkStep Foundation Initiative

Youtube

Instagram

Twitter

From India To The World

Our work is designed around the belief that technology, especially AI, will cause paradigm shifts that can help people reach their potential. Join us in building AI systems that work for billions.

Get Involved

An EkStep Foundation Initiative

Youtube

Instagram

Twitter