Google DeepMind’s India unit is making giant strides in its project to develop AI models capable of understanding and preserving 125 Indian languages and dialects. The initiative, known as Morni (Multimodal Representation for India), is working to address the linguistic diversity of the country by building inclusive and equitable AI systems.
This project targets reportedly over 100 Indian languages, with a focus on 60 languages spoken by over a billion people and 125 languages that each have more than a lakh speakers, revealed Manish Gupta, Director of Google DeepMind in India at the Global Fintech Fest in Mumbai on Thursday.
Gupta spoke on the immense challenge posed by the lack of digital data for many Indian languages. He pointed out that 73 of the 125 targeted languages had no existing digital corpus, making them vulnerable to digital extinction. Even Hindi, spoken by nearly 10 per cent of the global population, has a negligible online presence, comprising just 0.1 per cent of internet text.
To tackle these challenges, Google launched Project Vaani, a collaborative effort with the Indian Institute of Science (IISc) and ARTPARK (Artificial Intelligence & Robotics Technology Park). The project has already completed its first phase, creating an open-source database of over 14,000 hours of speech data across 58 languages, collected from 80,000 speakers in 80 districts.
Originally announced in December 2022, Project Vaani aims to collect and transcribe 1,54,000 hours of anonymised speech data from all 773 districts of India. The project is currently in its second phase, focusing on expanding data collection to cover 160 districts across the country.