AMD Bulldozer FX-8150
When evaluating design ideas for the next generation x86 processor core, AMD engineers looked at ways of optimizing core power and area. Analyzing the bursty nature of today’s PC applications led engineers to look for a way to maximize peak bandwidth across the different cores, and maximize the use of silicon area through the use of shared modules.
Below we see the Bulldozer FX-8150 chip, not unlike past chips. The die size is 315mm square, with a transistor count of about 2 billion.
The result was to design dual core building blocks that would effectively optimize the resources within the processor. Functions with high utilization (such things as Integer pipelines, Level 1 data caches) are dedicated in each core. The other units are now effectively shared between two cores and include: Fetch, Decode, Floating point pipelines, and the Level 2 cache. This design allows two Cores to each use a larger, higher-performance function unit (ex: floating point unit) as they need it with less total die area than having a separate setup.
”Bulldozer” was designed to balance performance, cost and power consumption on multi-threaded applications. The architecture focuses on high-frequency and resource sharing to achieve optimal throughput and blistering speed in next generation applications. AMD FX Processors offer up to eight high-performance, power-efficient cores. These represent the First generation of a new execution-core family from AMD (Family 15h). Other specs include:
- 128 KB of Level 1 Cache, 16 KB/Core, 64-byte cacheline, 4-way associative, write-through
- 8 MB of Level2 Cache, 2 MB/”Bulldozer” module, 64-byte cacheline, 16-way associative
Integrated Northbridge which controls:
- 8 MB of Level3 Cache, 64-byte cacheline, 16-way associative, MOESI
- Two 72-bit wide DDR3 memory channels
- Four 16-bit receive/16-bit transmit HyperTransport™ links
The floating point unit in Bulldozer has also undergone a complete re-design. It has been improved to support many new instructions and has been redesigned to allow resource sharing between Cores. There are two 128-bit FMACs shared per module, allowing for two 128-bit instructions per Core or one 256-bit instruction per dual Core module.
On newer benchmarks, the new floating point unit is at its best, able to perform quick 128bit instructions, as well as support acceleration of FMA and XOP operations. Applications using older floating-point instructions are typically unable to take advantage of the full performance of the floating-point unit, which is optimized for the newer FMAC instructions.
Bulldozer “Front End”
The front-end unit is responsible for driving the processing pipeline, and was designed to make sure that the Cores are constantly fed with information. It has been designed to work with each dual core unit, and allocate threads to individual cores themselves. AMD has made heavy changes that include decoupled predict and fetch pipelines, as well as prediction-directed instruction prefetchers. A Prediction Queue can manage direct and indirect branches that are now fed with a L1 and L2 Branch Target Buffer, which stores destination addresses.
“Bulldozer” modules can decode up to 4 instructions per cycle, (vs 3 on AMD Phenom™ II processors). The prediction pipeline produces a sequence of fetch addresses. The Fetch pipeline does a look up in the instruction cache, and pulls 32 bytes per cycle into the fetch queue which feeds the decoders.
“Bulldozer” uses a physical register file (PRF) which is a single location that holds the register results of executed instructions. This reduces power by eliminating unnecessary data movement and data replication (keeps one copy instead of broadcasting the data).
Each Core is equipped with a 16 KB Level 1 Data cache, a 32-entry fully associative DATA TLB, and a fully out of order load/store capable of two 128-bit loads per cycle or one 128-bit store per cycle. Each dual Core module includes a 2 MB 16-way unified L2 cache with an L2 TLB capable of 124 entry, 8 way that services both instruction and data requests. Bulldozer supports up to 23 outstanding L2 cache misses for memory system concurrency.
Let’s take a look at the improved Turbo Core technology.