GPUs, FPGAs, TPUs for Accelerating Intelligent Applications

Intelligent Applications are part of our every day life. One observes constant flow of new algorithms, models and machine learning applications. Some require ingesting a lot of data, some require applying a lot of compute resources and some address real time learning. Dedicated hardware capabilities can thus support some of those, but not all. Many mobile and cloud devices have already hardware accelerated support for intelligent applications and offer intelligent services out of the box.

Generally, the following categories of hardware support for intelligent applications can be distinguished:

Graphics Processing Units (GPU)
Field Programmable Gate Arrays (FPGA)
Tensor Processing Units (TPU)

GPUs are one of the first specialized hardware support for intelligent applications. That may not have been so obvious, because their primary focus has been accelerating 3D rendering of games. In fact, originally only a small part of the GPU rendering capabilities was needed for machine learning, but it still had to go through all rendering stages. That changed with new architectures for GPU allowing more flexible pipelines. Compared to CPUs or FPGAs they are only good at few specific tasks and you need to buy new hardware to evolve with them. GPUs can be seen as a varient of ASICs (see TPUs below).

FPGAs are highly specialized chipsets. Their advantage is that they have much of the programming encoded in the hardware in form of reusable blocks that can be arbitrarily combined in software. While most of those reusable blocks are configurable to represent logic functions, they can also be specialized hardware providing memory, network interconnection and so on. Despite the flexibility they can offer very good performance, but are more difficult to produce and consume a lot of energy. They would be suitable machine learning algorithms that can be currently not accelerated by GPUs/TPUs and innovative algorithms where the complete potential of hardware acceleration is not yet known. FPGAs would allow to upgrade the hardware to accelerate those by doing a “simple” software upgrade.

TPUs have been initially proposed by Google and are based on application-specific integrated circuits (ASIC), see for an overview here. Hence, they have similarities with GPUs. However, contrary to GPUs all functionality not relevant for specific machine learning algorithms has been discarded. For example, they have less precision and more compact data types.

A more detailed descriptions of the difference between ASICs and FPGAs can be found here.

Nowadays I see more and more that specialized hardware can be combined into clusters (e.g. „GPU cluster“) to offer even more compute power.

Hardware support can significantly increase performance of intelligent applications, but usually only for a small part of them. The loading of data, the efficient representation of data in-memory, the search within the data, the selection of the right subset of data and the transformation cannot be accelerated in most cases and require careful design of proper distributed systems using NoSQL databases and Big Data platforms. Hence, one should see those specialized solutions in context of a larger architecture than in an isolated manner.

What is the secret of specialized hardware?

Specialized hardware for intelligent applications improves a compute-intensive machine learning algorithm by implementing part of algorithm, such as matrix multiplication, directly in hardware and at the same time using for calculation more efficient non-standard data types. For instance, TPUs use instead of floating point numbers non-standard narrow integers. They require less space, can be calculated faster in hardware, but are less precise. However, for most of the intelligent applications the precision of other data types which are much more in-efficiently to calculate does not matter. FPGAs work in a similar way, but provide some flexibility to reprogram data type or which part of the algorithm is hardware optimized. GPUs are more generic and have been originally designed for other purposes. Hence, they are not as fast, but they offer more flexibility, are more precise and are for most of the machine learning problems sufficient.

Specialized hardware can optimize the training part of intelligent applications, but what is much more important it can optimize the prediction part which is usually more compute-intensive because it has to be applied in production on the whole dataset.

Who uses this specialized hardware?

Until some years ago specialized hardware were used by large Internet companies whose business model was to process large amounts of data of their users in an intelligent way to make revenue out of it. Nowadays many more companies leveraging specialized hardware – let it be in the cloud or on-premise. Most of them work with video/image/audio streams and trying to address use cases such as autonomous cars, image and voice recognition.

Can it be made available in your local data center?

Specialized hardware can be deployed in your data center. Most data center hardware vendors include a GPU in their offering that you can procure as an extra component to be made available in a virtualized form for your applications. TPUs and FPGAs are more uncommon whereby the first is usually custom-made and not many are currently available for normal enterprise data centers. FPGAs are more available for the data center, but require a lot of custom programing/adaptation which most enterprise wont do. Furthermore, they might not be supported by popular machine learning libraries.

However, if you want to provide them in your data center then you also need to thing about backup&restore and more important disaster recovery scenarios covering several data centers. This is not a trivial task, especially with contemporary machine learning libraries and how they are leveraged when developing intelligent applications.

What is available in the cloud?

All popular cloud provider offer at least GPU support. Some go beyond and offer TPU, FPGA support:

Amazon AWS currently supports general GPU Virtual Instances (P3/P2), GPUs for graphic processing (G3) and FPGA instances (F1). Some of the instance types support GPU clusters for very compute intensive intelligent applications. You can also dynamically assign GPUs to instances only for the time you need them to save costs. The AWS Deep Learning AMIs (virtual images) have already common machine learning toolkits preconfigured and installed that integrate with the offered hardware accelerators. Many other services exist that offer specific functionality (image recognition, translation, voice recognition, time series forecasting) that have the acceleration in-build without the consumer of those services even realizing.
Microsoft Azure has a similar offering. There you find as well dedicated instances for accelerating machine learning and graphics processing. Special images have preconfigured the operating system and popular machine learning libraries. Brainwave offers the integration of FPGAs in your intelligent application. Cognitive services offer intelligent functionality, such as image, voice recognition search etc.
Google Cloud platform has a similar offering to the ones above.

As you can see, the cloud can even free you from extremely costly tasks such as developing machine learning models, training of them, preparation of training data, evaluating them and most costly making them production ready. You can simply consume intelligent services without a heavy upfront investment. This model makes for many enterprises most sense.

What about software support?

Many libraries for intelligent application support GPUs and have even specialized versions by the cloud providers or hardware vendors that integrate more “exotic” support (e.g. TPU or FPGA). Due to this, the libraries start supporting also clustering of GPUs and other specialized hardware, ie the combination of several GPUs to have more compute power for an intelligent application. However, this is complex and thus one should be careful not working too low level, because you will spend more time on optimizing the code than focusing on the business problem. It is for a normal enterprise also not likely that they can optimize better than existing libraries provided by cloud providers and hardware vendors. This means for many applications they do not use libraries, such as Nvidia CUDA or to some extend Tensorflow. They are for very specialized companies or very specialized problems that requires new type of models or data handling way beyond existing models provided by higher-level frameworks. Instead most of the enterprises will rely on higher-level libraries, such as Keras, MXNet, DeepLearning4j or cloud services as described before. Reason is that leveraging the specialized hardware and combining it with an an algorithm requires a lot of know-how to run it properly and efficient. For instance, you can see also from the TPU description above that some optimizations (e.g. accuracy) are restricted in the hardware and the algorithms have to be adapted to this.

How to deal with costs?

One side note on costs for on-premise specialized hardware. The specialized hardware has very fast iterations – every few months a new better version is offered. That does not mean that you need to buy every few month new hardware. Instead you can think about renting hardware for your datacenter, combining several specialized hardware items to one (“GPU Cluster”) and/or hybrid cloud deployments. Furthermore, if your intelligent application is very “hardware hungry” in a lot of cases a better design of the application can bring you more intelligence and higher performance.

The cloud offers various cost models, for instance, the most prominent one is hourly billing of GPU time.

What can I not do?

Not all machine learning algorithm benefit from specialized hardware. Specialized hardware is basically very good in everything that relies on matrix operations. Those are the foundation of many machine learning algorithm, but not all. This applies to training of the data, but more important also to forecasting. The other category (e.g. variants of decision trees or certain graph algorithms) nonetheless have a need and usually perform in any case well without hardware acceleration. Furthermore, specialised hardware will help you little with one of the most important steps, e.g. may types of feature extraction, cleaning of the data etc.

Conclusion

Keep in mind that there is a lot to optimize in an intelligent applications. Key is to understand the underlying business problems and not “fall in love” with specific hardware or algorithms. Not all intelligent algorithms can be sped up with specialized hardware, but those are nevertheless useful in many scenarios.

In some cases a different application design can bring significant performance and accuracy increase without relying on specialised hardware. For instance, by leveraging NoSQL databases.

In other cases, a better understanding of the underlying business problem to solve will deliver insights on how to improve an intelligent application.

However, there are use cases for specialised hardware for training of certain intelligent applications, but more importantly for prediction, which needs to be applied on all items of a dataset in production and is usually very compute intensive.

Enterprise should carefully evaluate if they want to enter the adventure of machine learning and even integrate it with specialised hardware. Nowadays cloud services offer intelligent services that can be simply consumed and that have been trained by experts. Replicating this in the own enterprise requires a heavy up-front investment. Especially, if you are just starting the journey for intelligent applications, it is recommended that you first check the cloud offerings and experiment with them. The offered intelligent services by cloud providers should serve as benchmarks for internal endeavors to determine if they can compete and are worth the investment.

In the future I see a multichip world where an intelligent application leverage many different chips (GPUs, FPGAs, TPUs) at the same time in an intelligent cluster. Even each dataset and algorithm can have a custom developed chip. Those ensembles will be the next steps towards more sophisticated intelligent applications and especially cloud services ready for consumption.

Kommentare

Schreibe einen Kommentar Antworten abbrechen