AMD Bulldozer - FX 8150 Performance Review

Product: AMD FX 8150 / Asus Crosshair V
Company: AMD
Author: James Prior
Editor: Charles Oliver
Date: October 11th, 2011

Bulldozer Architecture

The basic building block of the AMD FX processor is the Bulldozer module. This fundamental premise was to combine latency tolerant functionality and improve performance with using dedicated hardware where needed, an approach similar to Core Multi-Processing (CMP). The resulting module is comprised of two x86 cores, with some neat shared dynamically allocated resources. In the AMD FX processor, four modules are combined with north bridge, memory controller, hypertransport interface and miscellaneous I/O busses to make a single die which, in desktop product form, is codenamed Zambezi. The die contains around 2bn transistors and measures around 315mm2, manufactured on GlobalFoundries 32nm SOI process.

The front end consisting of the decoupled instruction fetch and decode is shared, but this doesn't necessarily indicate a bottlenecked single stage. There are four complex x86 decoders, with a two level instruction TLB cache and prefetcher. The front end can switch every cycle, and is decoupled into different pipelines to reduce timing and stall problems. Prediction directed instruction prefetch to 16Kb L1 data cache is intended to enable very accurate and power efficient operation. When an instruction cache miss occurs, the queue is checked for future misses to be cached into L2, reducing latency.

The integer and floating point processing units are comprised of two fully independent out of order x86 cores, supporting AMD-64 naturally. There is also support for moveless copy, through PRF-based (physical register file) register renaming, an important power and latency saving measure. Each core has a unified scheduler, with OoO load/store capable of two 128-bit loads/cycle or a single 128-bit store/cycle.

The floating point unit features a unified scheduler for the dual 128-bit fused multiply accumulate (FMAC) pipes and dual packed integer pipes (x87 & SIMD - MMX, SSE). In previous architectures we've seen the floating point unit as a 128-bit unit accompanying the integer core, and that's still the case - there are two 128-bit FMACs, one for each core. AMD has a very interesting approach to AVX, the scheduler breaks down a 256-bit AVX instruction into two 128-bit operations, which is processed by the two FPU units. Execution and/or exception status is reported to the parent integer unit, from which instruction stream the FP instructions derived. The parent integer core handles the reporting and instruction retirement. Further, the 128-bit FPU for each integer core be subdivided for dual 64-bit or quad 32-bit concurrent operation, and each integer core can request one or both FP units if they are free - if one thread is integer heavy, and the other float heavy, both FMACs can be dispatched from the 'float heavy' thread. If you're optimizing for the hardware, this could be a way to get lots of FP performance with decent integer from a Bulldozer architecture product.

While each integer core features a dedicated L1 data cache and separate TLB cache, the Integer and FP cores share a 2MB 16-way L2 cache which is write through from L1. The L2 is server optimized, with 1024-entry 8-way page walker which completes both instruction and data side requires. There are in fact 2 walkers, which can be accessed concurrently, and can operate independently out to system RAM when addressing interleaved memory locations. Cache arrangement is changed due to how the cores are brought together in a module as the basic building block. L1 instruction cache is 64Kb per module, with 16KB of L1 data cache per core. L2 is unified for all cores in the module, and is 2MB per module. L3 is shared across all modules, but non-contiguous on the die with each 2Mb block a 16-way cache but operating as a 64-way 8Mb cache whole. Windows 8 will be required to take full advantage of the dual-core module nature of Bulldozer for most efficient thread scheduling, and we hear talk of the need for existing kernels (Windows, Linux) to be patched to reduce L1 cache thrashing due to scheduling problems and thread moving, which may increase performance by ~5% for certain workloads.

New Instructions

In 2008, a new instruction set for hardware acceleration of AES was proposed, and first adopted in Intel's products under the family line codenamed Westmere, but since then most but not all the of Sandy Bridge, Gulftown, Clarkdale and Arrandale processors launched since then have AES support. Bulldozer adds support for AVX, as well as SSSE 4.1 and 4.2. AMD's AVX also includes their own superset of instructions known as XOP, originally destined as part of SSE5. AMD also extends AVX with FMA4 support, instructions for fused multiply add (e.g. x + y*z), where the 4 denotes the number of registers used. Intel offer FMA3 support in Sandy Bridge, using 3 registers - meaning one of the source registers is overwritten by the result. Bulldozer's FMA4 doesn't trash the source data, instead placing the result in a fourth register. FMA is largely irrelevant to desktop applications at the moment but very notable for server and high performance compute - having both the original source and resultant output data available for immediate reuse can maintain throughput in sequentional and repetitive code loops, where FMA is most likely to be used. Companies optimizing to FMA4 should see significant gains over FMA3 thanks to the non-destructive nature. The next Bulldozer core revision, Piledriver (samples of which are already taped out and working), will support both 3- and 4-operand variants.

Power and Clock Speed Adjustments

For the desktop, AMD has three interesting technologies for managing power and clocks, only one of which you will interact with - AMD Turbo Core. Turbo Core allows the processor to dynamically increase the frequency of the running cores, on per core basis, within the TDP limit of the processor. By analyzing the currently executing workload, the processor can increase performance without using additional, unbudgeted power. This self-overclocking mechanism is an evolution of what we've seen before in the Phenom II X6 and A-series APUs. The Turbo Core system on Bulldozer can operate on all cores or half of them, depending on workload. The base frequency given of the CPU has a natural TDP buffer built in, which allows a Max turbo core of half the cores to run, or an all core turbo mode. For thermally insignificant periods of time, the processor can exceed TDP rating to boost performance.

The behind the scenes parts are clock gating and power gating, mechanism by which areas of silicon that aren't needed for current operation are reduced in power use, or turned off. Like Llano, Bulldozer supports C6 sleep states which allow modules to be 'dark' - no power to them, despite the processor being used. Clock gating is performed inside the module, to keep cores from using power when they are unused, and when both cores are unused and the module enters to C6 sleep the L2 cache is flushed to L3.

AMD has made some changes to base floor clock speed, moving from clock multi of x4 and 800MHz to x7 and 1400MHz. This jump allows the processor to be more responsive and offer better performance at lower utilization rates. Intermediate clock states between the base clock speed and normal full clock speed are used at 0.5 multiplier increments, even inside modules - one core may run at x9 multi, the next at x9.5 - or the clock rates can be quite different, one at x9 and the other at x16.5. The Turbo Core increase can be up to 600MHz over base clock frequency, on 4 cores in an 8 core chip.

The memory controller and Northbridge are on a separate clock and power domain from the modules and feature DRAM power management and APM support. Like Llano, Bulldozer supports up to PC3-15000 (DDR3-1866) in dual channel configuration and PC3-12800 (DDR3-1600) for four stick configurations.

Each desktop Zambezi CPU is equipped with four HyperTransport links, but only one connected, running at 5.2GT/s (2.6GHz). The Valencia and Interlagos brethren for the Opteron product line have more enabled, 3 HT links on the Opteron 4200 series (Valencia) and 4 HT links on the Opteron 6200 series (Interlagos - which uses some of the HT links in each Bulldozer module for interchip communication, as it is a multi-chip module consisting of two die in a single package). Back in desktop land, the Northbridge sees a little clock speed jump, now running at 2.2GHz, up from 2.0 in the STARS-based processors.