In the last few months, large language models have become as ubiquitous in our workdays as using an Excel spreadsheet. OpenAI's ChatGPT initially made waves upon its release, triggering a frantic race among technological giants to outdo each other with even more advanced iterations of these models. These language models seem almost magical in the way they engage with users, exhibiting pristine language understanding with impressive grammar, semantics, and all the subtleties that make language versatile across contexts. They also possess the remarkable capability to comprehend a wide array of language expressions and tones of the language they’re trained on.
But how do they achieve such prowess? How can an online machine converse so fluently? The secret lies in their training. These models undergo extensive training on vast datasets of language, learning from millions and even billions of examples to grasp the intricacies of language and its nuances.
Looking back to when I began incorporating BERT, one of the earliest large language models, into my machine learning models, I was frankly awed to learn that the base model had been trained on 110 million parameters. In stark comparison, ChatGPT-3.5 has been trained on a staggering 175 billion parameters! To put this into perspective, it means that ChatGPT-3.5 has been exposed to a dataset containing tens of terabytes of text—an astonishing feat indeed!
LLMs, today, derive their knowledge from a wide array of digital sources accessible on the internet. These sources include public websites such as Wikipedia, digital books, news articles, code repositories, and other reservoirs of textual data available online.
Given the substantial amount of content required to achieve a comprehensive understanding of language, a significant proportion of LLMs demonstrate their utmost proficiency in English, yielding the most accurate responses in this language. GPT-4 gives the following response when inquired about its proficiency in English:
It’s worth noting that there is a greater volume of data available and generated online in English compared to any other language. According to W3Techs, as of April 2024, approximately 50.5 per cent of the content available online is in English. Nevertheless, efforts are underway to collect more data in other languages and make LLMs give accurate and linguistically correct results in those languages.
In India, these endeavours are evolving into community-wide initiatives with participation from the government and it is encouraging to witness this remarkable progress! We are on the brink of reaping the rewards of this crowdfunded data collection efforts in the form of more accurate language specific LLMs.
However, let’s first discuss an intriguing new technology under the umbrella of LLMs, which promises enhanced security and tailored solutions for niche business applications. Notably, using this technology means that we can confine proprietary data and models to dedicated machines. We’re talking here about the concept of 'local LLMs,' which has garnered significant attention recently, as an offshoot of innovation stemming from advancements in LLMs. Presently, large language models like ChatGPT, Claude AI, and Google’s Bard are considered 'hosted' LLMs. This implies that users access the technology via the internet, utilising remote servers provided by major cloud computing platforms. For instance, ChatGPT relies on Microsoft Azure as its cloud computing platform, which hosts and operates the necessary models for internet accessibility. Conversely, a local LLM operates on local hardware, bypassing the need for remote cloud services.
At first glance, there could be convincing motives for implementing an LLM on a local level. These reasons may include safeguarding data privacy and minimising the sharing of personal data online, tailoring LLMs to suit business-specific requirements locally, and providing uninterrupted offline functionality for users, particularly in areas with inconsistent internet access.
The advantages may only run skin deep though. One of the foremost challenges associated with running LLMs locally lies in the demanding nature of the computational resources required. For example, running models like GPT-3 and GPT-4 mandates powerful GPUs (Graphical Processing Units) with ample video random access memory (VRAM). On the operational front, the machine must possess substantial RAM, preferably exceeding 64 GB—a configuration not commonly found in typical personal computers, unless one specifically upgrades their system for this purpose. It is safe to assert that the hardware costs alone are humongous when maintaining a machine capable of running a local LLM.
This brings us to the software side of things. Local LLMs require the installation of specialised deep learning frameworks and libraries on the machine to execute AI models. On top of this, the engineers and technical staff needed for running and maintaining such LLMs are also costly. Adding the cherry on top, businesses implementing local LLMs must also ensure robust security measures are in place to safeguard the data on local machines, which also must comply with all ethical and legal requirements of the data laws of the country the business works in. All in all, it sounds like baking quite an expensive cake!
One of the main reasons why one might hesitate to adopt local LLMs is their lack of continuous access to online updates. As computational infrastructure evolves, newer versions of retrained models are seamlessly delivered to us. A small business or startup that has integrated its solutions with OpenAI models might be working with version 4 of a language model today, but tomorrow, the founding tech giant could release an even more enhanced version through their hosted servers. From a business perspective, this means not having to worry about constantly downloading and reconfiguring newer versions on my machine. This prevents the use of outdated models, which could result in suboptimal outcomes and potential security risks.
Lastly, let’s discuss the most fundamental building block of LLMs: relevant data collection, with a sufficiently large volume. As mentioned earlier, English stands as one of the few—if not the sole—languages with an ample supply of reliable and trustworthy data required to train an NLP model. When examining other languages, particularly regional Indian languages, data availability is severely limited and insufficient for generating high-quality results from a large language model. According to a study by Kantar in 2023, 1 in 5 Indian language users said that they encounter misinformation while consuming content online.
Currently, nationwide initiatives are underway to ensure that we get accurate data for training language models capable of responding in various regional Indian languages and facilitate widespread adoption. MeitY’s Bhashini initiative unveiled "Bhasha Daan," a groundbreaking crowdsourcing platform dedicated to curating a vast open repository of data aimed at enriching Indian languages. Sarvam AI’s innovative OpenHathi model, based on Meta AI’s Llama2-7B architecture, prioritizes advancements in Hindi and Hinglish language processing. Krutrim Si Designs' Krutrim LLM stands out for its remarkable ability to comprehend and interact fluently in 20 Indian languages. Businesses can comfortably rely on such broad national or international data collection initiatives to acquire the good data and linguistic expertise through hosted and cloud-based LLMs, rather than attempting to gather data locally and train subpar models on small datasets.
Local LLMs may offer enhanced privacy and control but require careful assessment due to challenges like infrastructure costs, expertise needs and scalability. Due to simplicity and ease of access, cloud-based LLMs can be a more practical choice for nascent businesses and organisations looking to enter the world of GenAI.
(The author is a data scientist and AI researcher based out of New York)