NVIDIA GK110: Tesla K20 Launch Preview

Product: NVIDIA GK110
Company: nVIDIA
Author: James Prior
Editor: Charles Oliver
Date: November 12th, 2012

NVIDIA Tesla K20 Launch Preview

The international conference for High Performance Computing (HPC), networking, storage and analysis known as Supercomputer 2012 (SC12) begins today with the release of the new Top 500 list of Supercomputers. If you listen carefully, you can hear the sound of the Sequoia supercomputer crowned in June 2012 being toppled by a Titan, as the USA Dept. of Energy IBM BlueGene/Q system is replaced by Oak Ridge National Laboratories, their Jaguar system upgraded. The outgoing system was measured at 16.32 PETAFLOPS to the newcomers more than 20 PETAFLOPs; Jaguar's 2.3 PFLOP becomes nearly an order of magnitude higher, with 90% of that compute power coming from the Tesla K20 add-in board (AIB) in each node: 18,688 boards in total.

The Tesla K20 is 'big Kepler' - NVIDIA's 7.1bn transistor compute monster using the Kepler architecture seen in their consumer, professional and compute boards so far in GK104, 106, and 107 chips. GK110 is different, with a much higher double precision (DP) rate - instead of 1/24th of single precision (SP) that the smaller chips have, GK110 is 1/3 rate. This translates to 3.95TFLOP SP and 1.31TFLOP DP in a 235W board, just shy of AMD's headline 4 TFLOP (DP) FirePro W9000 single GPU but well ahead of the FirePro's 1 TFLOP (DP) rate, and surpassing the direct competitor FirePro S9000. NVIDIA's own Tesla K10 features double GK104 GPUs each with 4GB of RAM - a GeForce GTX 690 for the server realm with lowered clocks - and boasts 4.57TFLOPS SP and 190GFLOPS DP, in a 225W TDP board. K20 loses some SP FLOPS/watt, but gains tremendously on DFLOPS/watt. So much for NVIDIA 'walking away from compute'. However, AMD has beaten NVIDIA to the 'most powerful' punch with the FirePro S10000, a dual Tahiti GPU card launched today featuring 5.91TFLOPS SP and 1.48TFLOPS DP performance from an 825MHz core engine clock and 3GB of 5Gbps GDDR5 per GPU. Now we know why New Zealand hasn't hit the consumer market.

AMD is now caught in the middle: previously, if you wanted SP performance, the K10 was the card (provided your workload scaled over two GPUs and fit in 4GB chunk) and it fits in the PCI-Express specification 225W limit too, but you'd need to be very special indeed to prefer K10 over W9000 for DP performance. K20 addresses that, offering a single GPU with larger memory size, too. The K20/K20X comparison of performance/watt is interesting, showing where these cards lie:

K20X 3950 1310 235 $3199
K20 3520 1170 225 $4000 - $5000
K10 4577 190 225 $3399
Xeon Phi Unknown 1000 245 Unknown
S9000 3230 806 225 $2399
W9000 4000 1000 274 $3399
S10000 5900 1480 375 $3999

With the release of the Tesla K20 series, NVIDIA has a winning product in both single precision and double precision performance. Even with 7.1Bn transistors, K20X offers better performance per watt for both single and double precision over AMD's W9000, meaning any premium charged over the W9000/K10 price point is likely to be deserved through raw performance alone.

Product SP Perf/W DP Perf/W
K20X 16.81 5.57
K20 15.64 5.2
K10 20.34 0.84
Xeon Phi   4.08
S9000 14.35 3.58
W9000 14.59 3.64
S10000 15.73 3.95

NVIDIA is not announcing pricing at this time, preferring instead to let their partners do that. In practice this means market forces will apply, with demand and customer standing dictating pricing and availability. There are two models for organizations and enterprises to choose from, the Tesla K20X and the Tesla K20. The difference in the naming is small, and the differences of the cards are small, just 10W difference in TDP and an active fan on the K20 are the external difference.

The GK110 design features 15 SMXs, with Tesla K20 featuring 14 active SMXs and Tesla K20X 13 active SMXs. The memory bandwidth changes from K20 to K20X as well, dropping from the full 384-bit and 6GB to 320-bit and 5GB. Clocks, boost capable but limits not disclosed, are also different from K20 to K20X; K20 runs the presumed 2688 cores at 732MHz and has a 235W TDP, K20X runs 2496 cores at 706MHz and has a 225W TDP. ECC is supported, using 12.5% of the base memory capacity and reducing bandwidth by between 2% to 15%, depending on workload. NVIDIA claim this is roughly half the impact that ECC had on Fermi based products, further increasing the performance benefit seen in upgrading to K20/K20X. Neither card features display outputs, and presumably require a 6-pin and 8-pin PCI-e power input connector, perhaps just dual 6-pin on the K20X. We inquired about SLI to be told that the board doesn't support it, but there are clearly two SLI connectors on the PCB in the promotional images supplied to us... oversight in the mockup or indication of GK110's future destination - the desktop?

There are many purposes for Tesla K20, it's not restricted to customized HPC code and Linpack records. Industry, as opposed to education or research, is the biggest consumer of compute resources and CUDA powered Telsa stands poised to accelerate many segments, with key segments already accelerated. This is not to say education is being neglected, to the contrary NVIDIA claim CUDA as the world's most pervasive parallel programming model also being the easiest to transition programmers into the world data parallelism. AMD is much closer on perf/w in the GPU realm than in CPU, yet remains further behind in marketshare - underscoring the importance of the software and developer ecosystem. NVIDIA has put the most money into this area, and it shows.

As SuperCompute 2012 approached, Cray - the makers of the ORNL Titan - announced a new supercomputer design that theoretically shames Titan, scaling out to 100PFLOPS. Cray is switching from AMD's Opteron processors to Intel Xeon in their XC30 Cascade design. Cascade will feature up to one million cores, but it's not known what type of cores those are; the basis for Cascade is the Intel Xeon E5-2600 series of processors with Xeon Phi coprocessors or Kepler architecture Telsa boards. This is a big blow to AMD, dropped off the list for the HPC market won for them by the original Opteron design, although as long as the maintain socket compatibility we'll see existing supercomputers upgraded with successive generation of Opterons. Either way, NVIDIA has shown big compute happens very well with GPGPU and CUDA. NVIDIA may not have made a GK110 with all 15 SMXs enabled, and the two very closely performing products are certainly an interesting byproduct of yields, but they hit the mark of total power consumption, performance per watt, and features to be the leading accelerated compute card.