GK110 Geforce Titan Finally Unveiled!

The beast has finally been unleashed! Nvidia has finally decided to launch the super-performer Geforce Titan for the consumer market and for those with deep pockets. The GK110 has 15 SMXes which is composed of a number of functional units, the GK110 also has 192FP, 32 CUDA cores, 64KB of  L1 cache, 65k 32bit registers, and 16 texture units. To coincide with these specifications, the GK110 invokes 6 ROP partitions, each one having 8 ROPS, 256KB of L2 cache which is connected to a 64 bit memory controller.

Coming in at a massive 7.1 billion transistors, it takes up 551mm2 on a 28nm process from TSMC. The GK110 was originally meant to be the flagship of the 600 series but NVIDIA had other plans and held the GPU at bay and was promptly replaced with the GK104 core instead. After winning the super-computing bid for the Oak Ridge National Laboratory’s Titan supercomputer, Nvidia probably emptied their pockets of all the GK110 cores with the Tesla K20x GPU’s that they had but they are now ready to release the second device based on the GK110 (technically the third) which they call their greatest accomplishment for the consumer market.

The Geforce GTX Titan is using a restricted GK110 core will all 6 ROP partitions enabled and the full 384bit memory bus enabled, but only 14 of the 15 SMXes enabled which means the Titan will be swinging with 2688 FP32 CUDA Cores and 896 FP64 CUDA cores. Fear not, for this is actually identical to the K20x NVIDIA is currently shipping, there isn’t any GK110 currently shipping that has all 15 enabled anyways and no one really knows why. The GTX 680 was able to clock as high as 1006mhz but this is not the case for the Geforce GTX Titan, since it is a much bigger GPU it has to be downclocked to a reasonable 837mhz with a boost clock of 876mhz which isn’t that much of a boost to be quite frank. However, NVIDIA were kind enough to slap on 6GB GDDR5 ram on the GPU making it very ideal for those with multi-monitor displays, this allows it to have more bandwidth then it knows what to do with and a whole lot more shading/compute and texturing performance.  Since Titan is based on a compute GPU, enthusiasts will also benefit from the extra horsepower as well without any of the limitations that prevented other GeForce GPU’s from reaching their full potential in an effort to protect NVIDIA’s Tesla line up.

GK110’s efficiency compared with other platforms

Titan makes use of the same form of cooler that it’s predecessor the GTX 690 and 590 had before it, showing off it’s luxury status  there is no piece of  plastic to be found on the card and NVIDIA has tried to make it as clean as possible, there is even a poly-carbonate window allowing you to see the heatsink to your heart’s content. Titan moves forward from the 4+2 power phase design that debuted on the GTX 680 and make use of a 6+2 power phase  design, a 6pin and an 8pin provides power for the card which allows for a total of 300w of capacity with Titan sitting comfortably with a TDP of 250w and a subtle overhead for extreme overclockers. In addition, Titan follows suit with precedence and has two DL-DVI ports, one HDMI port, and one full-size DisplayPort.

Front view of the Geforce GTX Titan

NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110

Due to all the hype surrounding the Geforce Titan, I think it would do everyone well to talk about the more technical aspects of the technology behind this product and to review and evaluate it.  Quoted by Nvidia to be the fastest and most efficient architecture ever built, the Kepler GK110 fills the spot that the GK106 was meant to replace but failed to do so. Everyone knows that the GK110 core was meant to be the flagship for Nvidia’s Geforce and Tesla line, but it apparently didn’t go to plan due to the low yields so Nvidia had to look for something else to cover up this tragedy.

The GK106 just so happened to fit the bill, but came at the price of delivering sub-par compute performance compared to the GCN HD7970. The chip managed to be defeated by the GF110 used by the GTX 580 in some tests which is extremely surprising to say the least, especially after making such a big deal of compute performance in the 500 series, although there were clear examples of sabotage on Nvidia’s part in a effort to protect their professional line, for example Nvidia limits 64-bit double-precision math to 1/24 of single precision effectively reducing GPGPU performance.

The reason why compute performance is so important is because these are the same Graphics cores used inside Nvidia’s professional line which means that if something was to poorly perform in a consumer oriented product then chances are the same thing will happen in professional oriented products. Although it’s not really a major issue since the professional line makes use of the GK110 core.  However, everyone knows that Nvidia will not allow itself to be willingly slaughtered by AMD when it comes to compute performance without fighting back, this is where the role and responsibility of the GK110 core comes in.

GK110: Overview

According to Nvidia, the Kepler GK110 comprises of 7.1 billion transistors and is also the most architecturally complex microprocessor ever built and was originally designed to be a compute powerhouse for Tesla and the HPC Market. The GK110 will also provide provide over 1 teraflop of double precision throughput with greater than 80% DGEMM efficiency versus 60-65% on the prior Fermi architecture, and as we all know, the power efficiency for Kepler is outstanding. The Kepler GK110 introduces new features such as increased GPU utilization and simplifying parallel program design something which be considered a healthy asset to many developers. A full Kepler GK110 implementation includes 15 SMX units and six 64-bit memory controllers, although not all products will use all of the SMX units, some products will use 13 to 14 SMX units.

The key features of the architecture includes…

  • The new SMX processor architecture 
  • An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at each level of the hierarchy and a fully redesigned and substantially faster DRAM I/O implementation (expect to see an ARM processor handling this in Maxwell)
  • Hardware support throughout the design to enable new programming model capabilities.

The benefits of Dynamic Parallelism

One such feature is the Dynamic Parallelism feature which adds the capability for the GPU to generate new work for itself, synchronize on results and control the scheduling of the work without involving the CPU. Programmers can now take advantage of more varied kinds of parallel work and make the most efficient use of the GPU as the computation “evolves” and advances. This benefits the system by offloading work from the CPU and programs become more easier to create.

The benefits of Hyper-Q

Hyper-Q allows the system to use multiple CPU cores to work on a single GPU simultaneously which in turn increases GPU utilization and significantly reducing CPU idle times. What Hyper-Q basically does is increase the total number of active connections between the host and the GK110 by allowing up to 32 simultaneous, hardware managed connections. Applications that encountered false serialization across tasks which would limit the ideal percentage of GPU utilization can see an dramatic increase in performance without any changes to CUDA code.

NVIDIA GPUDirect

This is a capability that enables GPU’s within a single computer, or GPU’s in different servers to directly exchange data without needing to go to the CPU or Memory. It also reduces the demands on system memory bandwidth and frees the GPU DMA engines  for use by other CUDA task.  The GK110 also supports other GPUDirect  features including peer-to-peer and GPUDirect for Video.

————————————————————————-

Conclusion

The description and documentation of the GK110 available far surpasses what is been said in this article and it would not do the product justice if I listed everything here, so instead everything else will come in regular intervals. The fact that the GK110 being a compute monster has been already proven with many achievements appearing in recent news such as the sale of a cluster of K20x to China for a supercomputer. If the release of the Geforce Titan does come to pass, then we will be able to see the gaming performance of this card and whether it annihilates the competition or not.