NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110

Due to all the hype surrounding the Geforce Titan, I think it would do everyone well to talk about the more technical aspects of the technology behind this product and to review and evaluate it.  Quoted by Nvidia to be the fastest and most efficient architecture ever built, the Kepler GK110 fills the spot that the GK106 was meant to replace but failed to do so. Everyone knows that the GK110 core was meant to be the flagship for Nvidia’s Geforce and Tesla line, but it apparently didn’t go to plan due to the low yields so Nvidia had to look for something else to cover up this tragedy.

The GK106 just so happened to fit the bill, but came at the price of delivering sub-par compute performance compared to the GCN HD7970. The chip managed to be defeated by the GF110 used by the GTX 580 in some tests which is extremely surprising to say the least, especially after making such a big deal of compute performance in the 500 series, although there were clear examples of sabotage on Nvidia’s part in a effort to protect their professional line, for example Nvidia limits 64-bit double-precision math to 1/24 of single precision effectively reducing GPGPU performance.

The reason why compute performance is so important is because these are the same Graphics cores used inside Nvidia’s professional line which means that if something was to poorly perform in a consumer oriented product then chances are the same thing will happen in professional oriented products. Although it’s not really a major issue since the professional line makes use of the GK110 core.  However, everyone knows that Nvidia will not allow itself to be willingly slaughtered by AMD when it comes to compute performance without fighting back, this is where the role and responsibility of the GK110 core comes in.

GK110: Overview

According to Nvidia, the Kepler GK110 comprises of 7.1 billion transistors and is also the most architecturally complex microprocessor ever built and was originally designed to be a compute powerhouse for Tesla and the HPC Market. The GK110 will also provide provide over 1 teraflop of double precision throughput with greater than 80% DGEMM efficiency versus 60-65% on the prior Fermi architecture, and as we all know, the power efficiency for Kepler is outstanding. The Kepler GK110 introduces new features such as increased GPU utilization and simplifying parallel program design something which be considered a healthy asset to many developers. A full Kepler GK110 implementation includes 15 SMX units and six 64-bit memory controllers, although not all products will use all of the SMX units, some products will use 13 to 14 SMX units.

The key features of the architecture includes…

  • The new SMX processor architecture 
  • An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at each level of the hierarchy and a fully redesigned and substantially faster DRAM I/O implementation (expect to see an ARM processor handling this in Maxwell)
  • Hardware support throughout the design to enable new programming model capabilities.

The benefits of Dynamic Parallelism

One such feature is the Dynamic Parallelism feature which adds the capability for the GPU to generate new work for itself, synchronize on results and control the scheduling of the work without involving the CPU. Programmers can now take advantage of more varied kinds of parallel work and make the most efficient use of the GPU as the computation “evolves” and advances. This benefits the system by offloading work from the CPU and programs become more easier to create.

The benefits of Hyper-Q

Hyper-Q allows the system to use multiple CPU cores to work on a single GPU simultaneously which in turn increases GPU utilization and significantly reducing CPU idle times. What Hyper-Q basically does is increase the total number of active connections between the host and the GK110 by allowing up to 32 simultaneous, hardware managed connections. Applications that encountered false serialization across tasks which would limit the ideal percentage of GPU utilization can see an dramatic increase in performance without any changes to CUDA code.

NVIDIA GPUDirect

This is a capability that enables GPU’s within a single computer, or GPU’s in different servers to directly exchange data without needing to go to the CPU or Memory. It also reduces the demands on system memory bandwidth and frees the GPU DMA engines  for use by other CUDA task.  The GK110 also supports other GPUDirect  features including peer-to-peer and GPUDirect for Video.

————————————————————————-

Conclusion

The description and documentation of the GK110 available far surpasses what is been said in this article and it would not do the product justice if I listed everything here, so instead everything else will come in regular intervals. The fact that the GK110 being a compute monster has been already proven with many achievements appearing in recent news such as the sale of a cluster of K20x to China for a supercomputer. If the release of the Geforce Titan does come to pass, then we will be able to see the gaming performance of this card and whether it annihilates the competition or not.

Author: Kingsmin1994

Someone who likes to obtain various information across many different fields.

Leave a comment