AMD Radeon HD 7970 Launch Review

Product: AMD Radeon HD 7970 Video Card
Company: AMD
Author: James Prior
Editor: Charles Oliver
Date: December 24th, 2011

Southern Islands: Tahiti

The first details of the new Graphics Core Next (GCN) architecture were presented at AFDS '11. There, two of the main GPU architects, Mike Mantor and Mike Houston, sprayed the knowledge firehose about the compute capabilities and advantages GCN would offer. As a nice aside we got some time with AMD's Graphics CTO, Eric Demers, and you can read our interview with him here, plus our follow up interview about Southern Islands here.

A lot of important numbers have changed on the 7970 vs. the 6970. Thanks to 28nm process technology, AMD can cram 4.3Bn transistors into a die smaller than the Cayman ASIC. That transistor budget was used to mainly implement 2048 stream processors and a huge, fast, 384-bit memory bus. There are also a bunch of bigger, more functional data stores and caches, too. The new GCN cores are clocked higher, there are more of them, and there are more texture units too. 3GB of frame buffer seems almost like an aside after that. The memory is clocked at 1375MHz for 5.5gbps QDR operation, but is actually 6gbps memory - the lower clock speed reduces power a little and improves thermals while giving more overclocking headroom (1500MHz should be possible at least).

The basis of Graphics Core Next is power efficiency for both graphics performance and compute. This is achieved by removing the very large instruction width (VLIW) model used for the Terascale series of architectures, and replacing it with a vector unit with scalar co-processor. This is known as a GCN Compute Unit (CU), designed for high utilization and throughput, and multi-tasking - capable of executing instructions from multiple kernels at once.

A compute unit consists of four vector units, known as a GCN Quad SIMD, each a SIMD-16 with the same capabilities as VLIW-4 architecture SIMD. Each vector unit has its own 64KB register, and the scalar unit gets a 4KB register too. The scalar and vector units are fed by a dedicated scheduler, which has an accompanying vector and branch unit. For filtering, each GCN CU features four filter units and sixteen load/store units, with a 16KB read/write L1 cache and a 64Kb local data share. Double precision Fused Multiply Add (FMA) is performed at 1/4 rate of single precision (947 GFLOPS vs. 3.79TFLOPS theoretical throughput).

The CU's SIMDs are different from VLIW4 SIMDs in certain ways. GCN is designed to be easier to schedule and optimize for, and simpler to analyze and debug. This leads to simpler tools to work with it, making for hopefully more stable and predictable performance. These changes are primarily aimed at developers looking to leverage the compute functionality of the parts this architecture powers, rather than specifically for graphics performance, although the current trend of moving post processing and image quality effects into compute shader will benefit as well. The key capabilities remain the same but now the performance is occupancy limited instead of dependency limited, aiming to increase throughput and efficiency.

Instruction and data cache (16KB and 32KB, respectively, with the latter for the scalar processor) are shared between groups of four CUs and backed by L2. While each CU has its own registers and local data share, there is a global data share that is accessible by all CUs. The L1 transfers 64b/clock for each CU, and is used for read and write operations. Similarly, L2 is read/write cache and connected at 64b/clock. This is a major change from Cayman and Cypress, where these operations would have to go from board RAM and not internal chip cache. There are twelve L2 cache partitions, each 64KB, for a total of 768KB.

The 32 Compute Units are fed by dual geometry engines, similar to the concept of Cypress and Cayman, and like Cayman is capable of 2 triangles/clock setup rate. There are eight render back-ends (RBEs), which process 32 ROPs per clock and 128 Z/Stencil ROPs per clock (a rate unchanged from Cypress or Cayman). The six 64-bit memory controllers combine to make a 384-bit memory bus offering up to 264GB/s bandwidth and are capable of write and read combining, plus independent use by different CUs. Combined with the new L1/L2 cache abilities, this enables Partially Resident Texture caching (PRT, part of DX11.1) which can be used for much higher image quality textures being streamed from GPU memory to the cache on demand while being filtered or processed, to reduce stuttering. Hopefully, this feature will be leveraged by future game engines quickly, but better than ID's RAGE OpenGL implementation.

GCN introduces the 9th generation tessellation unit to the geometry engine so, like Cayman, there are two tessellation units in the front end. A combination of improvements has given the chip from 1.5 time to up to four times the throughput of the previous generation Cayman chip, at varying tessellation levels.

For use in compute applications, the Southern Islands GCN Tahiti design includes dual Asynchronous Compute Engines (ACE) for the independent scheduling and dispatch of work items, necessary for efficient multi-tasking. This allows compute workloads to operate in parallel with graphics workloads, and facilitates fast context switching so that demands by workloads that exceed concurrency abilities can be given needed resources. Despite featuring PCI-Express 3.0 which doubles interface bandwidth from 8GB/s to 16GB/s, plus support for numerous data and protocol commands, the internal dual DMA engines can push data through the bus to saturate that bandwidth.

The chip supports ECC protection for the memory, both SRAM and DRAM. This is offered in addition to the CRC protection first discussed in Cypress, which is part of the GDDR5 specification. While memory level ECC won't be enabled for consumer products, you'll see it in FirePro and Firestream products where ECC is a selling point, and not just because of high alpha or neutron radiation levels. The chip also supports OpenCL 1.2, OpenGL 4.2, and DirectCompute 11.1, and the new C++ AMP ratified this year.

DirectX 11.1 extends the capabilities of DirectX 11, primarily adding 3D support. There are a lot of features added, technically described here. The interesting ones include the ability to use logical operations rather than blend in a render target, shader video processing and Direct3D device sharing. These new features are designed to allow developers easier access to resources and simpler programming. Combined with OpenCL 1.2 and C++ AMP support, this is becoming a robust platform for vector compute execution.

Many features seen on Cypress and Cayman have been incremented and improved with Tahiti. The new card in reference design features two DisplayPort 1.2 outputs, an HDMI 1.4 output and a DVI-DL output. All four outputs can used with Eyefinity, with the standard requirement of only two outputs being 'legacy' DVI/HDMI without an adapter. In the box AMD will be including an active DisplayPort to DVI adapter.