Speed
An attribute of microprocessors is its speed. Speed refers to the rate at which a microprocessor executes instructions often measured in MIPS (million instructions per second). It is a function of clock rate, cache memory and architecture. Current microprocessors can execute more than 100,000s MIPS.
Applications continue to emerge that demand ever increasing speeds. For one such domain i.e. Artificial Intelligence (AI), demand for computing power continues to increase as its models especially deep learning ones continue to grow in complexity. Also increasing complexity directly translates into requirements of even larger datasets and hence more computing power to process data,
Challenges to increase speed
The speed at which microprocessor can operate is inversely proportional to the size of the transistors. Smaller transistors are closely packed and hence electrical signals takes less time to travel. Besides smaller transistors imply that more transistors can be packed and hence can be used to increase functionality that can be executed in parallel. Besides, smaller transistors consume less power and switch at faster speed.
It seems that performance may be increased by putting more transistors on a bigger microprocessor. But there is a restriction on increasing the size of microprocessor itself. A bigger chip is more prone to contain fabrication defects and results in more wastage in the wafer from which it is made. Separately, commercially available microprocessors use transistors created by < 10 nm processes and have billions of transistors on a chip. Such small transistors become susceptible to unwanted quantum effects and coupled with large numbers incur excessive heat dissipation.
AI chips
AI chip also called AI accelerator or AI hardware is a chip optimized for the needs of AI. There are certain features of AI models that these chips utilize. While AI executes instructions repeatedly, the precision needed is less. Using low precision arithmetic can save power consumption, reduce memory bandwidth demand as well as number of transistors needed for calculation. An entire AI algorithm can be placed on a chip. Also chip can be optimized to work with an AI specific language. The below mentioned approaches use these features for creating AI optimized chips.
Neuromorphic computing
Neuromorphic computing tries to mimic the way human brain works. Human brain has 100 billion neurons and can have upto 10000 connections or synapses with their neighbors leading to upto 100 trillion connections. Only a subset of these neurons are active at any time sending signal pulses to some of its neighbours.
One way to implement this is through Spiking Neural Networks or SNNs. Here each “neuron” sends independent signals to other neurons. In a network the connections between neurons would have weights. As signals or spikes travel from one neuron to its destinations, the pattern of their timings and their weights can convey information e.g. a separate pattern for each animal in image recognition problem.
Intel has created Loihi chip that has 130000 neurons and 130 million synapses and it can self learn. Its Pohoiki Springs system combines 768 chips to provide 100 million neurons. Similarly, IBM demonstrated creation of 16 million neurons and 4 billion synapses using 16 TrueNorth chips. Loihi chips can achieve recognition accuracy using 3000 times less samples than conventional Deep Neural Networks.
Graphical Processing Units (GPUs)
They were initially introduced for graphics using parallel processing. They have been customized for AI as AI offers large scope for parallel processing. Unsurprising Nvidia, the original pioneer of GPUs dominates the AI specific chips market. Its flagship chip, A100 has 54 billion transistors which is more than found in any CPU. AMD is another important player. These chips are used in data centres for training of AI models that needs heavy memory and compute resources.
Application specific integrated circuits (ASICs)
However lightweight chips are needed for “inference”, reading the predictions of AI models. But these chips need to be closer to the application e.g. in driverless vehicles it has to be on the vehicle to take correct decision of speed, braking etc. While training of model can use cloud based hardware, inference has to be on the “edge” using chips that are smaller in size. Applications of inference include cameras, smartphones, smartwatches etc besides driverless vehicles. Use cases of AI are diverse and hence customized chips called ASICs are used for inference. Higher end ASICs are used for training of AI models.
One important player in edge computing is British chip designer, ARM whose architecture underpins 95% of smartphones. TPUs (Tensor Processing Units) from Google are ASICs and are customized for its Tensorflow software. TPU v4 pod which is a combination of 4096 TPU v4 chips can touch performance of 1.1 Exaflops (10^18 Floating point operations per second) which is comparable to higher end supercomputers. Other companies e.g. Amazon, Facebook, Apple etc have their own ASICs
Field programmable gate arrays (FPGAs)
FPGAs can be used for both training and inference though they are mostly used for inference. They can be reprogrammed to suit a specific situation. They are less efficient than ASICs but find use as AI models and their use cases continue to evolve and hence FPGAs’ ability to be customized reduces costs. Their share in AI chip market is less than that of ASICs as their scalable deployment remains a challenge. Intel and Xilink are important players in this space.
Other approaches
Cerebras’ Wafer Scale Engine (WSE) is 46.225 mm2 in size and contains 1.2 trillion transistors and is optimized for AI models. It uses 16 nm processes. It plans to use 7 nm process for its next chip with 2.6 trillion transistors.
IBM has demonstrated a new approach of using computation within memory and thus avoiding the need of data to travel between memory and chip. It used 8 bit precision preserving accuracy while saving on time and energy.
Multiple smaller companies are working on novel approaches e.g. Graphcore’s Colossus MK2 is massively parallel and has multiple instructions working on multiple data, Kneron provides an architecture that allows reconfiguration of AI models in real time. Tenstorrent’s Grayskull processor dynamically eliminates unnecessary computation etc.
Other technologies
Computing needs of AI also contributes to research in Quantum computers which are based on probabilistic Quantum Mechanics models, Super computers which use multiple processors and DNA computers which have the advantage of having millions of molecules working in parallel.
Summary
The insatiable demand of processing power in AI is fuelling innovation in chip design that faces technological and commercial challenges. There are multiple approaches and players besides technologies in play. It is an emerging market with no timeline for announcing winners.