AMD Bulldozer is the latest generation of AMD processors. Who is a System Administrator?

AMD decided to take a completely different approach for the new Bulldozer architecture. It was decided to create dual-core modules that share some resources (L2 cache, floating point module), but are not completely independent of each other. (see picture below)
According to AMD, this was done in order to optimize the processor and at the same time to reduce the price of the processor. The optimization is that on conventional multi-core processors, some modules may be idle, and such modules can be combined in the Bulldozer architecture. And if there are fewer modules, it means less material will be wasted, and this, in turn, will have a positive effect on cost, energy savings and heat reduction.
Therefore, although AMD will call its new Bulldozer processors dual-core, in reality they will not be truly dual-core, since they will not have completely independent cores. And the name " dual core processor» will be used for marketing purposes.

For creating " quad core processors", AMD uses two of these units, so the processor actually has two "processors" inside (the two building blocks are shown in the image below), rather than four. AMD will continue to call the new processors quad-core.


Eight-core processor based on Bulldozer architecture.

Now let's take a closer look at the Fetch and Decode modules used in the Bulldozer architecture.

Fetch and Decode modules

The Fetch module is responsible for fetching instructions for decoding from the cache or random access memory.

Fetch and Decode modules.

As already noted, sampling modules use two “cores” at once. The L1 instruction cache is also used by two cores simultaneously, but each processor core has its own L1 data cache.
AMD has already announced that the L1 instruction cache used in the Bulldozer architecture consists of a 64 KB dual-way set-associative cache. The same configuration is used in processors with the AMD64 architecture, but the difference is that AMD64 processors have an L1 cache per core, while Bulldozer processors will have one L1 cache per pair of cores. However, the data cache will only have 16 KB, which is significantly less than the 64 KB per core used in processors based on the AMD64 architecture.

TLBs (Translation Look-aside Buffer- ultra-fast memory buffer). The sizes of TLBs have been revealed. These are buffers with a small amount of memory, designed to convert virtual memory addresses into physical addresses.
Virtual memory, better known as a page file, is a technology where the amount of RAM is “increased” by a special file on the hard drive.

Computer programs are written using x86 instructions, but currently processors only understand native RISC instructions. The decoding module is responsible for converting x86 program instructions into RISC microinstructions. The Bulldozer architecture has four decoders, but this moment AMD does not disclose which instructions each decoder executes. Typically one of these decoders executes complex, complex instructions using the provided ROM microcode (“µcode” or “microcode”). Decoding of complex instructions is completed after a few clock cycles, after which they are converted into several microinstructions. Typically, manufacturers optimize their processors in such a way that when decoding the most common instructions, they are executed in just one clock cycle.

Introduction There is no doubt that AMD's new processors, based on the Bulldozer microarchitecture, are among the most anticipated products not only of this year, but at least of the current five years. There are several reasons for this, as well as for the existence of a huge army of fans for AMD products. Some people have fresh memories of the times when the processors of this company were better than Intel’s in all respects. Some people love AMD products for their balanced combination of price and performance. And some were impressed by AMD’s emotional stories about the advantages of the microarchitecture being developed within the company. All this added up to many years of tedious waiting for the release of Bulldozer generation processors, and here is the result - you are reading this article with great attention and undisguised interest.

However, it's clearly worth it. The situation on the processor market in the next few years depends on how successful the Bulldozer microarchitecture turns out to be. After all, only Intel has the engineering and production resources to roll out new microarchitectural solutions every two to three years. AMD is forced to adhere to a much more measured pace in development. It's scary to remember, but the microarchitecture that is used in today's Phenom II and Athlon II processors goes back to 1999, and since then AMD has only been making cosmetic changes to it. Therefore, we have no special illusions that the development cycle will suddenly become more active with the release of Bulldozer. It is obvious that Bulldozer will be at the core of AMD's performance offerings for the next few years.

On current version The company's plans for the development of this microarchitecture are drawn up until 2014, but it will almost certainly continue further.

The fact that AMD promises a 10-15 percent increase in performance every year is more of an alarming symptom than an encouraging one. Most likely, such an increase will be provided primarily by an increase in clock frequencies, and only then by some new microarchitectural improvements.

In other words, the success of the Bulldozer microarchitecture in its current form will have a decisive impact on the future position of AMD, on the competitiveness of its products, and ultimately on the overall situation in the processor market.

Of course, it cannot be denied that Bulldozer for AMD is not the only key product. This microarchitecture is aimed at the high-performance desktop and server segment today. At the same time, AMD has other proposals for other market segments. For example, cheap, cost-effective processors with the Bobcat microarchitecture or APUs of the Llano family, released by the company earlier this year, are no less important areas for the company. And these proposals, as we saw from the test results, are successful solutions that can adequately act both as solutions for netbooks and nettops, and as the basis for integrated platforms in mid-price ranges.

However, the success or failure of Bulldozer has much more significant implications. First, this microarchitecture targets market segments with much higher profit margins - servers and desktop productivity systems. Therefore, it is capable of having a much stronger impact on AMD's financial condition. Secondly, success AMD processors series C, E and A - this, frankly, is not at all the merit of the engineers involved in the development of microprocessor design. The market success of these CPUs (or APUs, if we stick to AMD terminology) stems from the presence in them of graphics cores of the Radeon HD family, which found their way into AMD processors thanks to the timely purchase of ATI. Bulldozer is a kind of qualifying exam for an engineering team working specifically on the microarchitecture of computing cores. And thirdly, Bulldozer will ultimately become the basis of the entire line of AMD processors, with the exception of solutions for energy-efficient platforms. So, ultimately, it is this microarchitecture that will come to lower market segments, displacing K10 almost everywhere, including Llano processors.



In short, it is hardly possible to overestimate the importance of a successful launch of processors with the Bulldozer microarchitecture. This is an iconic product on both an emotional and materialistic level. And therefore I really want us to see, figuratively speaking, a new K7 or K8 in reality.

But even before testing, we can say that the chances of a repetition of such a phenomenon are small. Intel itself helped AMD seize the palm last time, trying to promote the far from ideal NetBurst microarchitecture. Then Intel engineers focused on increasing clock speeds, which eventually ran into obstacles in the form of gigantic leakage currents, while AMD offered a more balanced microarchitecture aimed at executing more instructions per clock cycle. But after Intel revised its doctrine and introduced a new Core microarchitecture, also aimed at executing the maximum number of instructions per clock cycle, AMD fell back to the position of a laggard, where it had been until now.

It is obvious that it is very difficult to surpass modern Intel processors in terms of the number of instructions executed per clock cycle. Today's Sandy Bridge microarchitecture is the result of at least three optimization cycles of an inherently efficient design, so we can't expect even higher specific core efficiency from AMD. Moreover, AMD engineers did not even set such a goal for themselves.

The main idea of ​​Bulldozer lies elsewhere. According to the developers, processors built on this microarchitecture should show good performance due to high clock speeds and a greater number of computing cores than their competitors and predecessors. At the same time, they should remain quite profitable in production, that is, they should not have too large a semiconductor crystal, and not demonstrate too high heat dissipation in terms of an individual core.

AMD Multi-Core Design Secrets

It is quite clear that an increase in the number of processor cores inevitably entails an increase in the area of ​​the processor chip. As a result, both the complexity of production and the cost of final products increase. Therefore, for example, processors with the maximum number of computing cores are used today only in the server market segment - corporate customers are much more willing to shell out money than individual users. The course chosen by AMD to increase the number of cores while maintaining an acceptable cost of the resulting processors must be coupled with a simplification of the cores themselves. However, on the other hand, simplifying kernels entails an undesirable effect - a drop in performance in applications with weakly parallelized loads, of which there are still a sufficient number at the moment.

Therefore, AMD engineers went their own way. The microarchitecture of individual cores has become more complex, increasing the number of instructions executed per clock whenever possible.



But it was decided to make part of the resources that are usually present in each core, but at the same time excessively efficient, shared between pairs of computing cores.



The resulting dual-core assembly became the basic building block for Bulldozer processors. Such a node, called a module in AMD terminology, has two full sets of integer actuators. But at the same time, the floating point unit, instruction prefetching and decoding devices, as well as the second level cache exist in a single copy for a couple of cores and share their resources between them. According to the developers' estimates, the power of these elements is quite enough for two cores, since when servicing a single core in real life they are often idle. In addition, delays in their uninterrupted operation do not have a serious impact on the resulting performance.

According to AMD itself, one dual-core module designed in the described manner is capable of delivering up to 80% of the performance of a full-fledged dual-core processor. At the same time, savings in the transistor budget (and, accordingly, in the area of ​​the semiconductor crystal) reach 44%.

Thanks to this ingenious core compaction, AMD was able to incorporate an eight-core (or quad-module) design into the basic design of the Bulldozer semiconductor die.



Moreover, a fairly significant part of the crystal is given over to cache memory. The second-level caches, shared between pairs of cores within each processor module, have a capacity of 2 MB, and the total L3 cache memory for the entire processor is 8 MB. Thus, taking into account the traditional AMD exclusive organization of caches, we can say that their total volume is 16 MB per eight-core processor. At the same time, the area of ​​the Bulldozer semiconductor crystal remains within acceptable limits, so the AMD developers have fully achieved their goal.



In absolute numbers, this means that eight-core Bulldozers will have a smaller semiconductor die than, for example, six-core Thuban processors (Phenom II X6), built on the K10 microarchitecture. However, it should be borne in mind that Bulldozer will be produced using a more advanced technical process with 32 nm standards. Compared to modern quad-core Intel Sandy Bridges, AMD's new eight-core processors will have only 45% more die area.

However, quad-core Sandy Bridge processors, thanks to the support of Hyper-Threading technology, just like Bulldozer, can be presented to the operating system as eight-core processors. This will certainly give rise to controversy about the legality of calling Bulldozer full-fledged eight-core processors. However, it should be understood that AMD and Intel have come to the permissibility of simultaneous execution of eight computing threads in different ways. Intel developers have screwed into their microarchitecture additional features, allowing two threads to run inside one core, on one set of execution units. AMD, on the contrary, cut out “extra” parts from two full-fledged cores, but there were only two sets of actuators inside each module.



As a result, Intel's Hyper-Threading technology increases multi-threaded performance by only 15-20%, while AMD's solution gives an 80% increase in performance when moving from 4 to 8 threads.

Although, of course, the semiconductor crystal of the eight-core Bulldozer, due to its modular structure, is really very similar to the quad-core one.


More instructions per cycle?

Increasing the number of processor cores alone will not get you far. This became clear even after the release of six-core Phenom II X6 processors, which are generally inferior in performance to quad-core Sandy Bridge. Therefore, AMD developers did not limit themselves to just extensive design changes. The basic microarchitecture of Bulldozer, compared to the K10, has been redesigned slightly less than completely, which gives hope for accelerating the operation of systems on AMD processors not only in multi-threaded tasks, but also in applications with a low level of parallelism. Moreover, these hopes are based on completely objective circumstances. While previous AMD microarchitectures were designed to execute three instructions per clock (on one core), the Bulldozer microarchitecture assumes the execution of four instructions per clock and is closer in this characteristic to competitor processors with Core microarchitecture.

Qualitative changes can be traced starting from the very first stages of the execution pipeline - from the stage of prefetching and decoding instructions. These stages are common to pairs of cores within a single module, so AMD took special care to ensure that they do not become a microarchitectural bottleneck. Instructions are fetched from the L1I cache for decoding in blocks of 32 bytes - twice as large as in processors with Core microarchitecture (second generation). The first level instruction cache itself has a capacity of 64 KB and two-channel associativity. Instructions intended for decoding are loaded into it from the second level cache in advance.

The branch prediction block, which is most directly involved in the sampling process, contains two sets of buffers that independently monitor the activity of different cores. Thus, when predicting the results of logical branches, Bulldozer does not get confused between threads. Since the new microarchitecture aims to operate at high clock speeds, the quality of the branch prediction unit is of utmost importance. Therefore, the algorithms used in it have been completely redesigned, and AMD hopes that the efficiency of Bulldozer's branch prediction will improve.



Bulldozer's x86 instruction decoder also splits its resources across two cores and is capable of decoding up to 4 incoming instructions per clock cycle. However, its performance is limited to issuing only four macro instructions (resulting from decoding in AMD terms), while x86 instructions can be split into 1-2 or even more macro instructions. Thus, although the decoder has increased its performance by a third compared to the previous generation of microarchitecture, its speed may not be enough, given that it is tasked with supporting two integer and one real-number computing clusters.

It should be noted that a certain analogue of the macro-fusion instruction fusion technology has also been used in Bulldozer. Some groups of x86 instructions can be combined into a single whole and passed through the decoder as one instruction - AMD calls this Branch Fusion.

The decoded macro-instructions are distributed into three computing clusters, two of which are the remains of full-fledged computing cores and one is real-numbered, shared between the cores. Each of these clusters has its own instruction reordering logic and its own scheduler. This obviously means that AMD retains the ability to completely replace or supplement some of these clusters in future products.

Reordering of instructions in each of the clusters is based on the use of a physical register file, which stores references to the contents of the registers and eliminates the need for constant data transfers within the processor when rearranging the order of instructions. This approach has replaced the reorder buffer in its place, as the physical register file is not only more efficient in terms of power consumption, but also more favorable to increasing processor clock speed.

Integer clusters contain two arithmetic execution units (ALUs) and two memory address units (AGUs). Compared to the K10 microarchitecture, the number of devices has decreased by one ALU and one AGU, but AMD assures that this will not significantly reduce performance, but the core area will save significantly. We readily believe that having more than two ALUs and AGUs in each integer cluster really does not make practical sense, because no more than four macro instructions per clock cycle can arrive from the decoder for execution by both clusters.



At the same time, actuators have become more universal; they practically do not differ in their functions.

The organization of the cache memory subsystem has seriously changed. The L1D cache was reduced from 64 to 16 KB and became write-through inclusive. At the same time, its associativity increased to 4 channels, in addition to which a “path predictor” was added. The reduction in the size of the first level data cache is compensated by a significant increase in its throughput; now it can service up to three 128-bit operations simultaneously: two reads and one write.

Obviously, changes in the L1D cache bandwidth are largely related to the need to implement 256-bit AVX instructions in the microarchitecture, support for which appeared in the FPU unit shared between the cores. However, this does not mean that real-number actuators have become 256-bit. In fact, the Bulldozer module has two 128-bit devices, and AVX instructions are decoded as linked pairs of 128-bit instructions. Accordingly, to execute them, FMAC devices (floating point multiply-accumulate) are combined, and the performance of a real-numbered cluster is reduced to one AVX command per processor module per clock cycle.



The FPU does not have its own first-level cache, so this cluster works with data through integer devices.

Since AMD engineers have already taken up the task of implementing support for the AVX instructions proposed by Intel, other relevant sets have been added to the Bulldozer processors: SSE4.2 and AESNI instructions aimed at accelerating encryption operations. In addition, AMD introduced a few of its own commands: the three-operand multiplication-addition FMA4 and its own vision of the further development of AVX - XOP.



The L2 cache in Bulldozer is shared within the processor module and shared between cores. Its capacity is an impressive 2 MB, and its associativity is 16 channels. However, the latency of the cache operating according to this scheme increased to 18-20 cycles, despite the fact that the bus width remained the same as before - 128-bit. This means that the L2 cache in Bulldozer, although large, is not very fast; competing and previous processors offer L2 cache with approximately half the latency. Coupled with a small L1D cache with a latency of 4 cycles (which is also more than in the K10 microarchitecture), all this does not look very encouraging. However, AMD claims that the cache latency has been increased solely to give Bulldozer the ability to operate at high clock speeds.



In addition, AMD engineers have implemented an efficient prefetch unit, which is designed to load the necessary data into the first and second level caches ahead of time. The performance of these blocks is said to have been improved, and they are now even able to recognize irregular data structures.

In theory, Bulldozer makes a good impression. AMD has completely revised its old approach to processor microarchitecture and implemented a completely redesigned design. Which, at first glance, looks very promising, because the new microarchitecture is optimized for the execution of four, rather than three, instructions per clock cycle on one processor core. In addition, it supports macro merging of instructions during the decoding process, which further increases specific performance.

But everything looks so good only as long as we look at only one nucleus and do not think about the fact that in reality such nuclei are combined in pairs. And the dual-core Bulldozer module has too many common parts for a couple of cores. In particular, due to the fact that such a module has only one instruction fetch unit and one decoder, the maximum number of instructions executed per clock cycle remains equal to four for the entire dual-core assembly. This means that the logical equivalent for a single Sandy Bridge core in terms of theoretical performance is the module, and not the Bulldozer core. The ability of the module to execute two threads in this case looks like a completely logical response from AMD to Hyper-Threading technology.

Of course, our testing of real processors will put everything in its place, but already at the stage of considering the microarchitecture we are forced to think that positioning Bulldozer as full-fledged eight-core processors is a marketing ploy. A more reliable assessment of the computing capabilities of these processors should be based on the number of modules, which, from the point of view of theoretical performance, are perfectly comparable with cores built on the second generation Intel Core microarchitecture.

In this regard, a completely logical question arises - why did AMD even bother with the implementation of dual-threaded processing within a single processor module? Why couldn’t it be possible to combine actuators distributed across two cores into a single cluster? There are several reasons for this.

Firstly, in order to simultaneously load a large number of actuators with work, in the general case, advanced intra-processor logic is required. AMD, obviously, was unable to implement highly efficient branch prediction and instruction and data prefetching units in the Bulldozer microarchitecture. Therefore, the task of parallelizing work and more optimal use of execution devices is shifted to software manufacturers, who must supply products with multi-threading support for Bulldozer.

Secondly, increasing the number of concurrently executed threads is not so bad. If for desktop users, and especially gamers, eight fairly simple Bulldozer cores do not promise any particular advantages, then in server applications such a microarchitecture should be met very favorably. So, it is quite possible that the main goal in developing Bulldozer was not to satisfy the aspirations of enthusiasts, but to restore AMD's position in the server market.

Turbo Core even more Turbo

Energy efficiency is one of the the most important characteristics modern processors. For example, in their future microarchitectures Intel pays attention to reducing energy consumption almost in the first place. AMD has not yet reached this point; the engineers of this company are primarily fighting for performance. But this does not mean that the developers did not care at all about the thermal and energy characteristics of Bulldozer. On the contrary, following Llano, fundamentally new approaches to increasing energy efficiency have found their way into Bulldozer processors. However, in this case, engineers used the freed-up potential not so much to save money, but to squeeze out additional performance by increasing clock frequencies.

Of course, new production technology has brought certain improvements in terms of energy consumption and heat dissipation. Bulldozer uses a 32nm process technology using high dielectric material, metal gate transistors and SOI technology. In other words, this is the same GlobalFoundries technical process that produces Llano processors. Thanks to new technology With 32 nm standards, the operating supply voltages of serial eight-core Bulldozer processors do not exceed 1.4 V.

However, the main innovation that passed from Llano to Bulldozer is power gate transistors, designed to cut off power from certain parts of the processor. In Bulldozer, they allow you to independently relieve voltage from individual dual-core modules and from cache memory.



When both computing cores in the module enter the power-saving state C6, the module is de-energized. Unfortunately, this technology cannot be applied to processor cores, since there are simply no dedicated cores inside Bulldozer - they share some of the resources with their module neighbors.

The energy-saving states of the C6 cores are controlled in Bulldozer and Turbo Core technology. At those moments when at least half of the Bulldozer processor modules are in a power-saving off state, it increases its supply voltage and clock frequency. This forced operating mode is called Max Turbo Boost.

However, Max Turbo Boost is nothing new; such auto-overclocking was introduced by AMD in Thuban processors built on the K10 microarchitecture. What's really new is the All Core Boost mode, in which the clock speed can increase above the nominal value even when all processor cores are active. The improved version of Turbo Core introduced into Bulldozer allows the processor to judge with good accuracy its practical power consumption and heat dissipation, based on information about the workload of certain blocks. Accordingly, if, according to this assessment, the current heat dissipation and power consumption is significantly below the limit, the processor can increase its supply voltage and clock frequency even if not a single core is in a passive state.



Thus, the operating frequency of processors with the Bulldozer microarchitecture is an extremely variable value. Depending on the “severity” of the algorithms being executed and the number of cores involved, it can dynamically change over a very wide range, reaching 900 MHz.

Updated desktop platform

With the introduction of the new microarchitecture, AMD not only did not change the design of the platform, but even maintained the compatibility of Bulldozer processors with the existing infrastructure. Accordingly, just like their predecessors, the new processors contain an integrated north bridge, including a third-level cache, a memory controller and a Hyper-Transport bus controller. At the same time, despite the fact that all newly released AMD and Intel processors also have a PCI Express graphics bus controller built inside, Bulldozer does not have this.



Just like in processors built on the K10 microarchitecture, the built-in northbridge in Bulldozer uses its own clock frequency, which is set to 2.0-2.2 GHz for different models. Note that this frequency has a certain impact on performance, since it directly affects the speed of the L3 cache. Which in the current version of processors has a volume increased to 8 MB and has 64-channel associativity. Meeting the wishes of enterprise users, the data stored in this cache is protected by ECC error correction code.

The memory controller built into Bulldozer does not have any fundamentally new capabilities. As before, it supports DDR3 SDRAM, uses a dual-channel design and, in fact, consists of two independent single-channel controllers that can operate in either paired or uncoupled mode. AMD only added support for higher-speed memory types, declaring compatibility with DDR3-1867, and took care of compatibility with energy-efficient modules with operating voltages of 1.25 and 1.35 V.

Speaking about the desktop modification Bulldozer, which has its own code name Zambezi, it should be noted that it is aimed at a new Socket platform AM3+, also known under the code name Scorpius. Processor socket AM3+ has 942 pins, one more pin than Socket AM3. But, despite this, Zambezi remains compatible with older Socket AM3 boards. When installing new processors into old motherboards, in fact, only certain power management functions are lost. Thus, the frequency switching speed decreases when the Turbo Core and Cool"n"Quiet technologies are running and Vdrop does not work.

However, by the time Zambezi was released, AMD and manufacturers motherboards have prepared a galaxy of new products based on the new 900 series logic sets. The structure of a typical system based on the Zambezi processor and built on the new chipset is shown in the block diagram below.


The differences between the new AMD 990FX chipset (and its simplified versions AMD 990X and AMD 970) lie solely in support for the specific electrical properties of Socket AM3+, and they do not bring with them any new interfaces. Like the 800 series chipsets, the new south bridge features six SATA 6 Gbps ports and fourteen USB 2.0 ports. No matter how much we would like to see support for the PCI Express 3.0 specification or, at worst, USB 3.0 ports in the new system logic sets, there is nothing of the kind in them this time either. This, by the way, is very strange, because USB 3.0 support was introduced in chipsets for the lower-level Socket FM1 platform.

The differences between the modifications of the new series of system logic sets consist solely in the support of various multi-GPU configurations.


Zambezi processor range

The release of Zambezi processors completes the update model range, offered by AMD. Desktop processors based on the Bulldozer microarchitecture will become the new flagship offering of this manufacturer and will quickly displace all kinds of Phenom II modifications from the market.

Emphasizing the innovation of the new microarchitecture, AMD will use a new marketing name for Zambezi desktop processors - FX. On the one hand, it fits perfectly into the new nomenclature, which involves marking processors with letters, and on the other hand, it is a reference to the legendary Athlon 64 FX processors, which six or seven years ago were the fastest desktop CPUs. However, those days are irrevocably gone, so let's see what AMD is ready to offer us now.

In the near future, the range of FX series processors will include four models.



Despite the fact that the difference between Zambezi processor models is not only in clock speeds, but also in the number of active computing cores, they will all be based on the same unified semiconductor chip. Here it is:



To obtain processors with fewer than eight cores, AMD will disable some of them on the semiconductor chip. The possibility of unlocking them back, as was possible with processors with the K10 microarchitecture, is still in question. However, in the BIOS of motherboards based on 900 series logic sets that have passed through our laboratory, the corresponding options are present, so there is hope for a favorable solution to this issue.

Disabling cores to obtain six-core and quad-core processor modifications will occur “module by module”. That is, it will be the entire dual-core modules that will be blocked, and not the “second” cores inside them, although such a tactic would be much more beneficial in terms of performance. However, the release of six-core and quad-core processors built on the Bulldozer microarchitecture is explained not so much by marketing considerations as by the need to implement rejection, which, given the rather large dimensions of the chip and the new technological process, will be quite a lot.

Despite the fact that AMD has been sharpening the new microarchitecture to operate at high clock frequencies, we cannot yet call the achieved values ​​an impressive breakthrough. The four-gigahertz barrier remains unconquered, and the nominal frequency of the older FX processor is even lower than, for example, the Phenom II X4 980. We would like to hope that with the improvement of production technology, Zambezi frequencies will quickly go up. Although, if you believe the current version of AMD’s plans, the line will be accelerated no earlier than the first quarter of 2012.

There is no breakthrough in terms of heat release and energy consumption. AMD has long talked about how the Bulldozer microarchitecture will be more energy efficient, but in fact the older eight-core models have the same TDP level as the older Phenom II. True, after some time the company should add to its offerings a 95-watt version of the FX-8120 and an FX-8100 processor with the same calculated heat dissipation.

But the prices of the new FX-series processors look more than attractive. AMD does not want to deviate from its course of offering platforms at a more favorable price than its competitors, so the older eight-core Zambezi models are opposed to the older Intel Core i5 processors. In general, AMD plans to adhere to the following positioning scheme for its products:



In other words, AMD does not intend to compete with Intel’s six-core processors and the promising LGA2011 platform, but wants to focus on conquering the mid-price segment.

Good news for enthusiasts will be the fact that no multipliers will be blocked in all FX series processors. All Zambezi can not only be easily overclocked by simply changing the base multiplier, but also can be similarly reconfigured with Turbo Core technology. Also, overclocking of the memory subsystem and the frequency of the north bridge built into the processor is available.

Test processor: AMD FX-8150

AMD sent our editors the senior processor of the Zambezi family, FX-8150.



It has a nominal clock speed of 3.6 GHz, and more detailed information Its characteristics can be obtained from the given screenshot of CPU-Z.



Please note that the processor is based on the B2 stepping – and this is not the first version. Previous modifications of the semiconductor crystal were rejected by the manufacturer because they could not operate at the originally planned clock frequencies. This is what caused some delay in the announcement, which was initially planned in the spring, then in the summer, but in fact happened in mid-October.

However, the 3.6 GHz frequency achieved today does not look too impressive. Both AMD itself and Intel have products that run at higher speeds. However, the FX-8150 has very promising Turbo Core technology, which, under low load, can automatically increase the processor frequency up to 4.2 GHz.



It is noteworthy that a frequency of 3.9 GHz can be achieved even if the load is on all computing cores, but at the same time leaves room for auto-overclocking without going beyond the limits of power consumption and heat dissipation.



When idle, Cool"n"Quiet technology reduces the FX-8150's frequency to 1.4 GHz. The supply voltage drops to 0.85 V.


How we tested

We compared the new eight-core AMD FX-8150 processor, built on the Bulldozer microarchitecture, with one of its predecessors, the six-core Phenom II X6, and with competing (priced) Intel offerings - quad-core Core processors i5-2500 and Core i7-2600. In addition, for greater clarity, performance indicators for the six-core Core i7-990X processor have been added to the results.

As a result, the test systems included the following software and hardware components:

Processors:

AMD FX-8150 (Zambezi, 8 cores, 3.6 GHz, 8 MB L2 + 8 MB L3);
AMD Phenom II X6 1100T (Thuban, 6 cores, 3.3 GHz, 3 MB L2 + 6 MB L3);
Intel Core i7-2600K (Sandy Bridge, 4 cores, 3.4 GHz, 1 MB L2 + 8 MB L3);
Intel Core i5-2500K (Sandy Bridge, 4 cores, 3.3 GHz, 1 MB L2 + 6 MB L3);
Intel Core i7-990X Extreme Edition(Gulftown, 6 cores, 3.46 GHz, 1.5 MB L2 + 12 MB L3).

CPU cooler: NZXT Havik 140;
Motherboards:

Gigabyte 990FXA-UD5 (Socket AM3+, AMD 990FX + SB950);
ASUS P8Z68-V PRO (LGA1155, Intel Z68 Express);
Gigabyte X58A-UD5 (LGA1366, Intel X58 Express).

Memory:

2 x 2 GB, DDR3-1600 SDRAM, 9-9-9-27 (Kingston KHX1600C8D3K2/4GX);
3 x 2 GB, DDR3-1600 SDRAM, 9-9-9-27 (Crucial BL3KIT25664TG1608).

Graphics card: AMD Radeon HD 6970.
Hard drive: Kingston SNVP325-S2/128GB.
Power supply: Tagan TG880-U33II (880 W).
Operating system: Microsoft Windows 7 SP1 Ultimate x64.
Drivers:

Intel Chipset Driver 9.2.0.1030;
Intel Management Engine Driver 7.1.10.1065;
Intel Rapid Storage Technology 10.6.0.1022;
AMD Catalyst 11.10 Display Driver.

Please note that testing was carried out under the current version of the Windows 7 operating system, but AMD indicates that the task manager of this OS does not distribute computing threads in the most optimal way. Windows 7 primarily prefers to direct threads to kernels located in different modules. And this really provides higher specific performance, since it reduces the load on the blocks divided inside the module. However, this strategy prevents the inclusion of turbo modes, which could be used by the processor if some of the dual-core modules were in power-saving states.

Promising operating room Windows system 8 will follow a different tactic, and there threads will be assigned first to cores within the same module. As a result, AMD promises that in a number of applications, the performance of Zambezi-based systems can increase by up to 10%.

Performance

Preliminary evaluation of the effectiveness of the Bulldozer microarchitecture

Before we started “real” testing of processors, we decided to figure out what we could expect from the Bulldozer microarchitecture in principle. To do this, we conducted a small comparison of a processor with this microarchitecture with other CPUs with K10 and Sandy Bridge microarchitectures under artificially created equal conditions: at the same clock frequency and with the same number of activated cores.

More specifically, we compared the AMD FX-8150, Phenom II X6 1100T and Core i7-2600 at 3.6 GHz with only two processing cores enabled. For the purity of the experiment, all energy saving and auto overclocking technologies were naturally deactivated. A set of simple synthetic benchmarks included in the utility was chosen as testing tools SiSoft Sandra 2011, in which we forcibly disabled all instruction sets older than SSE3, since they are not supported in the K10 microarchitecture.



The numbers in the table speak louder than any words. The performance of the Bulldozer microarchitecture has become much lower than that of previous processors. Combining pairs of cores into one module with shared resources and the accompanying simplification of the microarchitecture led to the fact that at the same frequency, the specific performance of Bulldozer per core dropped by 25-40% compared to the previous generation AMD microarchitecture. As a result, Bulldozer cores are almost half as slow as Sandy Bridge cores. Moreover, the performance of the Bulldozer processor module, which includes two cores, is even lower than the speed of a single Sandy Bridge core with Hyper-Threading technology enabled. Should we expect performance records from a processor built on such a microarchitecture? The question is rhetorical.

Along the way, let's take a look at practical characteristics caches and memory subsystems. To evaluate the speed of operation of these functional units, we conducted tests in the Cachemem utility from the Aida64 package. In all cases, DDR3-1600 memory was used with latencies of 9-9-9-27-1T. Just like in the previous case, processor frequencies remained aligned at 3.6 GHz.



In Zambezi, compared to Phenom II processors, the practical latencies of both all caches and the memory subsystem have increased. We talked about this when considering the Bulldozer microarchitecture. However, by changing the logical organization of the cache memory, its throughput increased in almost all cases.

At the same time, the fastest dual-channel memory controller and the fastest cache subsystem are implemented in Sandy Bridge. Although, of course, in terms of cache capacity, the Intel processor is somewhat inferior to the Bulldozer microarchitecture media.

Overall Performance

To evaluate processor performance in common tasks, we traditionally use the Bapco SYSmark 2012 test, which simulates user work in common modern office programs and applications for creating and processing digital content. The idea of ​​the test is very simple: it produces a single metric characterizing the weighted average speed of the computer in common applications.

Let's remember that some time ago AMD tried to troll SYSmark, spreading allegations that it was biased due to the use of the “wrong” set of real applications. However, in our opinion, such a judgment is not justified, since it is common and really popular programs that are used to evaluate performance, the contribution of each of which to the final result is shown in the following diagram:



Therefore, we have not abandoned the use of SYSmark 2012 and continue to use its metrics to evaluate common performance.



The first test is such a disappointment. The result of the eight-core FX-8150 is only 10% better than the performance of the six-core Phenom II X6 1100T and, naturally, does not reach the performance of quad-core Intel processors at all. So the tactic chosen by AMD to implement a large number of cores with low specific performance in the processor instead of a moderate number of complex ones, in general, does not give a positive result.

A deeper understanding of SYSmark 2012 results can provide insight into the performance scores obtained in various system usage scenarios.

The Office Productivity scenario simulates typical office work: preparing text, processing spreadsheets, working with by email and visiting Internet sites. The script uses the following set of applications: ABBYY FineReader Pro 10.0, Adobe Acrobat Pro 9, Adobe Flash Player 10.1 Microsoft Excel 2010, Microsoft Internet Explorer 9, Microsoft Outlook 2010, Microsoft PowerPoint 2010, Microsoft Word 2010 and WinZip Pro 14.5.



The Media Creation scenario simulates the creation of a commercial using pre-shot digital images and videos. For this purpose, popular Adobe packages are used: Photoshop CS5 Extended, Premiere Pro CS5 and After Effects CS5.



Web Development is a scenario within which the creation of a website is modeled. Applications used: Adobe Photoshop CS5 Extended, Adobe Premiere Pro CS5, Adobe Dreamweaver CS5, Mozilla Firefox 3.6.8 and Microsoft Internet Explorer 9.



The Data/Financial Analysis scenario is dedicated to statistical analysis and forecasting of market trends, which is performed in Microsoft Excel 2010.



The 3D Modeling script is entirely devoted to creating three-dimensional objects and rendering static and dynamic scenes with using Adobe Photoshop CS5 Extended, Autodesk 3ds Max 2011, Autodesk AutoCAD 2011 and Google SketchUp Pro 8.



The last scenario, System Management, involves creating backups and installing software and updates. Several are involved here different versions Mozilla Firefox Installer and WinZip Pro 14.5.



At various models Using a processor with the Bulldozer microarchitecture demonstrates fundamentally different results. In some cases it turns out to be even slower than the Phenom II X6, but there are also the opposite situations. In general, the general rule is this: the advantage of the FX-8150 becomes especially noticeable where the workload is multi-threaded and well parallelized, but not computationally complex.

However, even in the most favorable situations, the FX-8150 lags behind the Core i5-2500. The only scenario where these processors are comparable in speed is 3D rendering. On average, Intel's offer is ahead of AMD's new product by an impressive 25%. Sadly.

Gaming Performance

As you know, the performance of platforms equipped with high-performance processors in the vast majority of modern games is determined by the power of the graphics subsystem. That is why, when testing processors, we try to carry out tests in such a way as to remove the load from the video card as much as possible: the most processor-dependent games are selected, and tests are carried out without turning on anti-aliasing and with the installation of far from the most high resolutions. That is, the results obtained make it possible to evaluate not so much the level of fps achievable in systems with modern video cards, but how well processors perform with a gaming load in principle. Therefore, based on the results presented, it is quite possible to speculate about how processors will behave in the future, when faster options for graphics accelerators appear on the market.


















Games do not belong to the category of tasks that generate a parallelized multi-threaded load. Therefore, for today's gaming applications, processors with four cores are more suitable, and not those multi-core monsters that AMD offers. We see a clear illustration of this statement in the diagrams below. The new eight-core FX-8150 is no faster than its six-core predecessor, the Phenom II X6.

As for the ratio of gaming performance between Zambezi and Sandy Bridge, AMD is still much more pessimistic for the new product. The current Intel processor microarchitecture handles the typical workload generated by 3D games much better, and there is no hope that AMD will ever be able to catch up with competitor processors in this category of tasks. In other words, using Bulldozer in gaming systems can only make sense when there is confidence that the performance of a specific processor is sufficient for a specific video subsystem in a specific set of games. However, even in this case, you need to realize that with the next video accelerator upgrade, you may remain at a serious disadvantage compared to those users who initially preferred the platform and modern Intel processors.

In addition to the gaming tests, we will also present the results of the synthetic benchmark Futuremark 3DMark 11, launched with the Extreme profile.



The purpose of adding these results was to show the very ideal situation for the FX-8150, when the video subsystem does not allow the processor power to be fully realized. Here the main load falls on the video card, and the processor plays only a supporting role. In such cases, we can talk about equal performance of Bulldozer and Sandy Bridge processors, although, of course, this is not entirely true.



However, the FX-8150 also looks good (compared to previous results) in the 3DMark 11 physical test. physical model AMD's new eight-core processor runs at speeds comparable to the quad-core Core i5-2500.

Tests in applications

Overall, Bulldozer's weighted average and gaming performance on the desktop was well below our expectations. However, let's not despair and try to find those cases when the new AMD microarchitecture is able to show its strengths.

To measure the speed of processors when compressing information, we use WinRAR archiver, with the help of which we archive a folder with various files with a total volume of 1.4 GB with the maximum degree of compression.



The result of the FX-8150 is close to the Core i5-2500. WinRAR is not one of the applications that can parallelize its calculations across all eight Bulldozer cores, but the gigantic cache memory seems to save the day.

The second similar test for archiving speed is carried out in the 7-zip program, using the LZMA2 compression algorithm.



In 7-zip, the FX-8150's performance is commendable. This eight-core processor manages to approach the speed of the quad-core Core i7-2600, which includes support for Hyper-Threading and which, like Bulldozer, can execute eight threads simultaneously.

The encryption performance of processors is measured by the built-in benchmark of the popular cryptographic utility TrueCrypt. It should be noted that it is not only capable of efficiently loading any number of cores with work, but also supports a specialized set of AES instructions.



Well-parallelized, simple integer algorithms are what the Bulldozer microarchitecture needs. In such cases, as we see, very outstanding performance can be obtained. In particular, when it comes to encryption, the FX-8150 lags only behind the six-core Core i7-990X and is ahead of all processors for the LGA1155 platform.

When testing audio transcoding speed, use the utility Apple iTunes, which converts the contents of a CD into AAC format. Note that a characteristic feature of this program is the ability to use only a pair of processor cores.



It is better to keep programs that generate a small number of computational threads away from Bulldozer. Some cores of this CPU are too weak to show any decent results in such cases.

We measure performance in Adobe Photoshop using our own test, which is a creatively reworked Retouch Artists Photoshop Speed ​​Test, which involves typical processing of four 10-megapixel images taken with a digital camera.



In Photoshop, the FX-8150's performance is not as disastrous as that of processors with the K10 microarchitecture, but it still falls far short of the Core i5-2500. Obviously, a large cache memory is a good help for the Bulldozer microarchitecture in this case, but this alone will not get you far. The efficiency and specific performance of computing cores is still of paramount importance.

We also carried out testing in Adobe program Photoshop Lightroom 3. The test scenario includes post-processing and JPEG export of one hundred 12-megapixel images in RAW format.



Lightroom can parallelize photo processing across any number of cores, and therefore the eight-core FX-8150 shows good results here. However, “not bad” is a relative concept in this case; in fact, its performance is comparable to only the Core i5-2500. This means that two Bulldozer cores are equal to one Sandy Bridge core without Hyper-Threading support.

Performance in Adobe Premiere Pro is tested by measuring the rendering time in H.264 Blu-Ray format of a project containing HDV 1080p25 video with various effects applied.



Previous generation AMD processors also handled video transcoding well. The Bulldozer microarchitecture allowed for a slight increase in performance in applications of this nature and, as a result, the FX-8150 is even faster than the Core i5-2500.

The speed of video editing using Adobe After Effects was assessed by measuring the running time of a predefined set of filters and effects, including blur, bump creation, frame blending, glow creation, adding motion defocus, shading, 2D and 3D manipulation, inversion, etc.



Despite the fact that the load is well parallelized, the FX-8150 lags behind Intel competitors in After Effects.

To measure the speed of video transcoding into the H.264 format, the x264 HD test is used, based on measuring the processing time of source video in MPEG-2 format, recorded in 720p resolution with a stream of 4 Mbit/sec. It should be noted that the results of this test are of great practical importance, since the x264 codec used in it underlies numerous popular transcoding utilities, for example, HandBrake, MeGUI, VirtualDub, etc.






When transcoding video with the x264 codec, AMD processors always showed good performance. With the release of the eight-core microarchitecture, their results have further increased, and now the FX-8150 even outperforms the Core i7-2600 in the second, most resource-intensive encoding pass. So, with considerable difficulty, we finally found a second application, in addition to TrueCrypt, where the performance of a processor with the Bulldozer microarchitecture deserves flattering reviews.

We measure computing performance and rendering speed in Autodesk 3ds max 2011 using the specialized SPECapc test. Starting with this testing we are starting to use the new professional version of SPECapc for 3ds Max 2011.






Rendering is also one of the tasks subject to optimization for multi-core microarchitectures. But despite this, the FX-8150 is still slower than the Core i5-2500 and Core i7-2600, not to mention the Core i7-990X. On the other hand, there is no shameful situation when a new AMD processor loses to its predecessor.

Averaging results across individual applications, the FX-8150 was about 14% faster than the Phenom II X6 1100T on our set of applications. And this allowed it to perform no worse than the Core i5-2500 in slightly less than half of the cases. However, the gap with the next Sandy Bridge model, Core i7-2600, remains significant and amounts to more than 10%.

Energy consumption

Despite the fact that we managed to find a set of tasks in which Bulldozer's performance can be called acceptable, processors based on the new microarchitecture do not look at all revolutionary. The only hope remains for power consumption, because previously AMD processors were more than significantly inferior to their competitors in this parameter. Now, if you believe the promises of the developers, the microarchitecture has become more focused on energy efficiency, and the new 32-nm technological process should have contributed to the improvement electrical characteristics. So let's look at the FX-8150 through the lens of performance per watt.

The following graphs, unless otherwise noted, show the total system consumption (without monitor), measured “after” the power supply and representing the sum of the power consumption of all components involved in the system. The efficiency of the power supply itself is not taken into account in this case. During measurements, the load on the processors was created by the 64-bit version of the LinX 0.6.4 utility. In addition, to correctly estimate idle power consumption, we activated all available energy-saving technologies: C1E, C6, AMD Cool"n"Quiet and Enhanced Intel SpeedStep.



When idle, the consumption of systems with processors built on the Bulldozer microarchitecture became lower than that of similar systems with Phenom II family CPUs. However, modern Intel LGA1155 systems consume significantly less in idle mode.



In the case when the computing load is single-threaded, the consumption of Socket AM3+ systems increases sharply, obviously due to the high aggressiveness of Turbo Core technology. With systems built on Intel processors, this is not observed, and they can again boast of significantly higher energy efficiency.



With a full multi-threaded load, the situation is not much different. Is it only the system with the LGA1366 Core i7-990X processor that “got ahead.” Otherwise, everything is as before. In terms of power consumption, the FX-8150 does not boast any particular success. It began to consume a little less than the Phenom II X6 1100T, but Sandy Bridge processors are at least one and a half times more economical.

AMD used all the energy efficiency gained through the introduction of a new microarchitecture to increase clock frequencies. And, as a result, we don't see any new level efficiency, nor fundamentally improved performance. Accordingly, in terms of performance per watt, Bulldozer, like its predecessors, is seriously inferior to competing microarchitectures from Intel.

For reference, we present the consumption at full load, measured separately in the power supply circuits of the processor and motherboard.






The "net" consumption of the eight-core FX-8150 exceeds the consumption Sandy processors Bridge approximately twice. Considering that both processors are manufactured using the same technological process and have similar core voltages, it becomes incredibly interesting what AMD meant when they talked about the energy efficiency of their Bulldozer microarchitecture.

Overclocking

The Socket AM3+ platform and FX series processors are initially positioned as overclockers. This is evidenced by both the complete unlocking of all multipliers and experiments conducted under the auspices of AMD, in which a world overclocking record was set using one of the FX-8150 processors. The company's statements that the new microarchitecture is optimized for operation at high clock frequencies also look promising. Are we really going to get a new overclocking miracle from AMD? Let's check.

Overclocking any FX processors is very simple; it’s not for nothing that “Unlocked” is written directly on their logo. The processor frequency can be changed by a multiplier either through BIOS Setup or through specialized utilities provided by both AMD itself (Overdrive Utility) and motherboard manufacturers. Similarly, in Socket AM3+ systems, you can overclock the northbridge and memory built into the processor.

During testing, we were able to achieve stable operation of our FX-8150 at a frequency of 4.6 GHz. To ensure stability in this state, the processor supply voltage had to be increased to 1.475 V, and, in addition, it was necessary to enable the Load-Line Calibration function. During stability tests, the temperature of the processor operating at this frequency did not exceed 85 degrees according to the socket sensor or 75 degrees according to the sensor built into the processor. To remove heat, we recall that an efficient air cooler NZXT Havik 140 was used.



Please note that at the same time we tried to overclock the north bridge built into the CPU, because increasing its frequency has a positive effect on the speed of the third level cache and memory controller. However, unfortunately, significant overclocking of this processor node encountered an invisible barrier, and it could not reach a frequency above 2.4 GHz, even though we simultaneously tried to increase its supply voltage.

In any case, overclocking the FX-8150 to 4.6 GHz is a good result, especially considering the fact that AMD processors of the Phenom II family were rarely overclocked in air beyond 4.0 GHz. In other words, the Bulldozer microarchitecture actually made it possible to slightly push the frequency limit higher.

However, overclocking of FX processors should be compared, first of all, not with the old Phenom II, but with competing Core i5 and Core i7 processors for LGA1155 systems. But they clearly accelerate no worse. For example, a quite typical overclock for the Core i5-2500K with a voltage increase of 0.15 V above the nominal and using an air cooler is 4.7 GHz. And against this background, the result of the FX-8150 no longer seems so brilliant.

The impression of overclocking Zambezi deteriorates even more if we compare the performance of the overclocked FX-8150 and the overclocked Core i5-2500K (the increase in performance relative to the nominal mode is indicated in parentheses):



In general, overclocking does not change the quality of the results. But where the FX-8150 was faster in nominal mode, the gap narrowed. And where the Core i5-2500 was in the lead, it consolidated its advantage. It’s not surprising: the frequency of the FX-8150 when overclocked increased by 28%, while the frequency increase of the Core i5-2500K was 42%. And in general, as can be judged by the magnitude of the performance gain from overclocking, the Sandy Bridge microarchitecture reacts more sensitively to increasing frequencies. In other words, even if we take into account overclocking, processors with the Bulldozer microarchitecture, although they overclock quite well, do not look stronger than Intel's competitors.

conclusions

Success or failure? Surely many of you want to see a clear verdict at the end of the article. However, in this case, everything is very ambiguous, and AMD has put reviewers in a very difficult position with its Bulldozer.

The fact is that AMD has demonstrated a completely non-standard approach to microarchitecture development. Considering that processor performance consists of three components: the number of instructions executed in the processor core per clock cycle, frequency and number of cores, developers have shifted their priorities to the number of cores. At the same time, the specific performance of individual cores was reduced, but the resulting design opened the way to the creation of inexpensive eight-core or even more complex processors. This is a very strong move for the server market, where multi-threaded loads and processors with big amount cores are in serious demand. So, it is very likely that the new Bulldozer microarchitecture will allow AMD to significantly improve its position in the performance server market.

However, today we got acquainted with the FX processor, built on this microarchitecture, but aimed at desktop computers. And this is where the discrepancy between Bulldozer’s hardware capabilities and typical desktop workloads became fully apparent. It's especially disappointing that the marketing campaign was structured in such a way that many believed in Bulldozer as a rising star in the desktop market. However, these hopes were not destined to come true.


FX processors, which are based on the Bulldozer microarchitecture, were able to demonstrate their strengths only in a small subset of tasks solved by ordinary users. Among typical common applications, there are not many examples that generate a simple integer multi-threaded workload, and Bulldozer's high performance is revealed only in this case. As a result, in some cases Bulldozer turned out to be not only slower than competing solutions from Intel, but even worse than the Phenom II X6 processor, built on the previous generation microarchitecture. And this means that AMD failed to produce a revolutionary desktop processor.

In fact, FX is just the next Phenom, which seems to be quite good in itself, especially compared to its predecessors. FX processors are generally faster than Phenom II, overclock significantly better and have slightly lower consumption, so they can be considered a good replacement for carriers of the outdated K10 microarchitecture.

However, let us remind you that AMD is at war not only with itself, but also with by Intel. Therefore, we are still forced to voice the disappointing conclusion that FX processors make real sense only in those desktops that are focused on video processing and transcoding. In other cases, compared to Sandy Bridge processors, their performance rarely looks encouraging. The same can be said about power consumption and overclocking. Separately, it should be added that AMD FX processors, as expected, turned out to be a bad option for gaming systems, since modern 3D games practically do not use truly multi-threaded algorithms. However, fans of AMD products will probably be able to put up with this, given that the number of frames per second in games is often limited by the graphics, not the processor.

In other words, the market prospects for FX processors will depend on two factors: how large the army of AMD adherents is; and on how skillfully the manufacturer will manage the price lever. However, desktop processors with the Bulldozer microarchitecture are clearly not expected to become widely popular.

AMD rarely indulges in fresh processor architectures. If Intel updates the structure every two years, then the competitor last noted in 2007, releasing the K10, a redesigned version of the old K8. So the appearance of a new Bulldozer is a significant event. For the next few years, the architecture will become the basis for all AMD crystals, as well as the first chance in a long time to compete with Intel in the race for performance.

We go as a couple

By creating Bulldozer, AMD engineers abandoned the proven strategy of improving and partially copying old developments. The structure of the stones is fundamentally different from what we are used to seeing in x86 systems.

The first and most important innovation is the original layout. All top versions of Bulldozer are officially equipped with eight cores. However, in reality there are four full-fledged modules, just each with two computing units. It looks like this: two integer arithmetic clusters (they are called cores and are directly responsible for calculations) share a Front-End, a floating-point cluster (FPU) and a second-level cache increased to 2 MB.

The benefit of such a tandem is saving space, reducing energy consumption and production costs. Disadvantage - sharing the same sets has a bad effect on the final performance. Under heavy load, one Front-End may not be able to cope with two cores. AMD does not deny the loss of performance: according to it, the duo is about 20% weaker than a full-fledged dual-core processor.

Communication difficulties

To eliminate the bottleneck, Front-End had to learn how to efficiently share resources between the two cores. To achieve this, the branch prediction unit and the command decoder were redesigned, which received a fourth channel for processing instructions (as in Sandy Bridge) and technology Branch Fusion. The latter allows you to glue part of the instructions into one operation. All this should speed up the work of the Front-End and prevent the crystal from being idle.

As for the cores themselves, this is a set of Out-of-Order, load/unload, L1 cache and two computing clusters. The out-of-order execution unit now has a physical file register. As in Sandy Bridge, addresses for storing working data are dropped into it, which allows you to unload the main Out-of-Order pipeline. The loading/unloading processor received an increased buffer, doubled capacity and the ability to work with virtual addresses, which theoretically should increase the speed of working with the L1 data cache. The latter in Bulldozer became four times smaller: 16 versus 64 KB in K10. The loss was compensated for by the speed of work. L1 associativity increased from two to four channels, which means twice the O greater search efficiency.

There are three computing clusters in one module: two integer and one for working with floating point data. Compared to the K10, the first pair lost one ALU (engaged in calculations) and AGU (dealt with memory addresses). In theory, this means reduced peak performance. In practice, the change will be practically unnoticeable: it is difficult to fully load integer clusters.

The main changes affected the FPU, which is responsible for complex floating point calculations. In K10 it became much more powerful: it received a pair of MMX and 128-bit FMAC devices for performing addition and multiplication operations. Unlike K10, FMACs have been made universal: they can replace each other, which has a positive effect on calculation speed. Plus, they learned to combine operations in one expression, which increased the accuracy of calculations.

In addition, the FPU received an updated set of instructions. First, the processor now works with AVX, which supports 256-bit registers. For their calculations, as in Sandy Bridge, two FMACs are combined. Secondly, Bulldozer can work with SSE 4.2, AENSI, FMA4 and XOP instructions. The last two sets are unique to AMD. For you and me, all these changes mean only one thing - commands that were previously made in several clock cycles will now be calculated in one, and this directly affects performance. True, in order to experience the increase in speed, support for instructions from the software is necessary.

Glue and scissors

As a result, each Bulldozer module consists of one Front-End, L2 and L1 data caches, two integer clusters and a block for working with floating point numbers. In total, one stone can contain up to four such sets. At the same time, each of them has access to a number of common elements. The first is a dual-channel memory controller with support for DDR3-1866 MHz. The second is the L3 cache, the volume of which, compared to the K10, has increased from 6 to 8 MB, and the associativity - from 48 to 64 channels. Note that, unlike Sandy Bridge, the frequency of the L3 cache does not coincide with the speed of the cores. If the top model operates at a speed of 3.6 GHz, then the memory of the latest level is at 2.2 GHz. This leads to noticeable delays that negatively impact performance. According to AMD, this sacrifice was made for the sake of stable operation at high frequencies.

Tadam!

Despite the architectural tricks and 32nm process technology, Bulldozer occupies an impressive 315 square meters. millimeters. This is about one and a half times more than the quad-core Sandy Bridge and older Llano. Fortunately, power consumption was kept within reasonable limits - 125 W.

In addition to eight-core models, there are versions with six and four computing units. The younger brothers are based on the same eight-core design, but they have one or two modules disabled.

The base frequency varies from 3.1 to 3.6 GHz. Like Sandy Bridge, Bulldozer has automatic overclocking technology. A special chip responsible for Turbo Core 2.0, monitors the current core load and TDP level and, as soon as the opportunity arises, increases the processor frequency. In the case of a top crystal, when all modules are used, the speed can be increased by 300 MHz. If some of the resources are idle - at 600 MHz. At low loads, Bulldozer goes into energy-saving mode, technology is responsible for this Cool"n"Quiet.

Manual overclocking is simple. Firstly, the entire line has an unlocked multiplier. Secondly, newcomers gain altitude well: under liquid nitrogen, the older Bulldozer set a new world record - 8429 MHz.

Companions

Bulldozer runs on Socket AM3+. In essence, this is a slightly improved AM3 with one additional pin. Chipsets with a new processor socket are called 990FX, 990X And 970 . They differ in the PCIe 2.0 controller. The older model is equipped with 32 lines, the younger ones - 16. Moreover, the 990FX and 990X support CrossFireX. Among the features of the chipsets, we note six SATA Rev ports. 3 and 14 USB connectors 2.0. There is no USB 3.0 controller.

Note that Bulldozer can also work on older boards. All you need is an updated BIOS. Limitations: Turbo Core and Cool"n"Quiet have reduced response speed, and some energy-saving functions are not available.

The Bulldozer processor architecture turned out to be interesting. Finally, AMD stopped copying itself and came up with something truly new. Unfortunately, there are few clear advantages over competitors. There are no declared eight cores. In a good way, we have quad-core models with an increased number of computing units, something like Intel Hyper-Threading, but at the hardware level. The idea is good, but performance will depend on how fast the Front-End is. The real advantages of Bulldozer include only a powerful FPU for floating point calculations and increased operating frequencies compared to K10.

Let's roll it out! Let's bury it!

AMD has announced plans to release the following lines of processors. The company expects to update the architecture annually, achieving approximately 15 percent performance gains per watt each time. If AMD sticks to its plan, we'll see the architecture in 2012 Piledriver(“koper”), a year later - Steamroller(“steam roller”), and 2014 will be remembered for the announcement Excavator. This is what construction work is like.

Wrong windows

According to AMD, Windows 7 unable to unleash the full potential of the new creation: the OS scheduler does not take into account the features of Bulldozer. For example, for new processors it is important that interconnected threads be assigned to one module, otherwise the cores will exchange data not through the fast L2 cache, but through third-level memory. Some split streams are also better treated in a similar way to improve the efficiency of Turbo Core 2.0. In the same time specific tasks create a greater load on the Front End block, and it is better to scatter them across different modules. Thanks to cooperation with Microsoft these nuances will be taken into account in the planner Windows 8. However, you should not expect a significant increase in performance.

Dictionary

Integer Computing Cluster- deals with operations with integers (1, 2, 10).

Front-End- prefetch block. Receives commands from the program and translates them into a language understandable to the processor.

FPU- cluster of floating point data calculations. Performs calculations with fractional numbers (1.2345) and large values ​​with powers (1.2345E-10).

Branch prediction block- predicts in advance what data and operations the program may need at the next moment. Does not allow the processor to idle.

Command decoder- breaks the program into micro-operations, which are then used by computing clusters.

Out-of-Order- block of extraordinary execution. Dealt with the distribution of actions between cores. Sends for calculation only those commands for which there is data.

Load/unload block (LSU) - monitors the movement of data between the output from the conveyor and the L1 data cache.

Cache associativity- linking cache lines and columns. The higher the associativity, the lower the search speed, but the higher its efficiency.

MMX- a set of blocks for working with numbers up to 8 bytes.

Instruction Sets- allow one command to perform an operation on several data.

Table 1

Specifications of AMD Bulldozer processors

Number of computing cores

Base frequency

Turbo Core Frequency

Memory support

Energy consumption

Technical process

Price as of November 2011

unknown

What makes up processor performance? Previously, there was a formula in use that described performance as the product of the number of instructions executed per clock cycle and the frequency at which this processor operates. Now a third factor has appeared in this formula - the number of computing cores. Therefore, a processor developer who wants to release a fast product has several options to do this.

However, not all so simple. Increasing the number of instructions executed by a computing core per clock cycle is a rather difficult task. Classic x86 program code involves sequential execution of instructions, and therefore, in order to achieve their parallel processing, the processor must be equipped with highly efficient branch prediction and instruction reordering units, the implementation of which requires considerable engineering effort. At the same time, the complication of microarchitecture affects the physical dimensions of the crystal and leads to restrictions when increasing the number of cores. So if a manufacturer is going to make a processor with a large number of cores, then the microarchitecture should, on the contrary, try to simplify. It's not easy with clock frequency. A bet on its growth will again require making changes to the internal blocks of the processor and lengthening its execution pipeline. The result is the following: in order for a processor to win a medal for performance, its developers must work hard to simultaneously optimize a number of parameters.

The problem also lies in the fact that any of the chosen ways to improve processor performance may be successful only for special cases. Not all programs can work effectively with a large number of cores. Some algorithms do not allow you to correctly predict transitions and reorder instructions. And in some cases, performance does not increase even with an increase in clock frequency, because there are some other bottlenecks in the system.

Finding the optimal balance is not easy, and what is considered the optimal criterion? We can only compare the performance of processors in a finite number of programs and select the fastest one for a given specific case. However, this does not at all guarantee that, using a different set of test tools, we will not get completely opposite estimates. Such a lengthy introduction is given here because today we are going to get acquainted with the new series of AMD FX processors - the flagship product of AMD, widely known under the code name Zambezi. This processor is based on the very controversial Bulldozer microarchitecture, which has already managed to collect a considerable bouquet of unflattering reviews. But the point is not that this microarchitecture is completely bad. When choosing the best balance of characteristics, the developers incorrectly assessed the needs of the majority of users and placed the main emphasis on the wrong factor in the “basic formula”. As a result, the initial plan to release a high-performance solution of a new generation went wrong, and AMD adherents, intrigued by the promises of a breakthrough, received something completely different from what they expected. However, is this a serious and objective reason for disappointment? We will talk about this in this material.

⇡ Counting kernels: eight or four?

While working on a new design for performance processors, AMD decided to prioritize the number of processing cores. This is a completely logical choice, based on the fact that over the years there is more and more multi-threaded software and the development of a microarchitecture designed for many years of development should primarily take into account not the current state of the market, but the observed trends. Eight cores, provided in the basic version of the new processor, are what AMD was going to conquer the market, where so far only chips were presented, the maximum number of cores in which was limited to six. ( Here we are talking only about desktop computers. — approx. ed. )

At the same time, the developers did not want to take the cores of the old K10 microarchitecture. Not only are they too big physical size, but also, as can be judged by Llano, they are not prone to operating at high clock frequencies even after being transferred to modern 32 nm technology. In addition, they do not support many modern features, such as AVX instructions. Therefore, to assemble eight-core processors, AMD made a new microarchitecture - Bulldozer. Representatives of the company prefer to say that its development was carried out from scratch, but in fact, in the Bulldozer cores you can find many references to another microarchitecture presented this year - Bobcat, aimed at use in compact and energy-efficient devices. However, the relationship between Bulldozer and Bobcat is quite distant, and we mention it only so that the general idea becomes clear - Bulldozer combines many relatively simple cores.

At the same time, we are not talking about the primitive combination of eight simple cores on one semiconductor chip. In this situation, the resulting processor would have very low single-threaded performance, and this would become a rather serious problem, since there are not so few programs that do not split the load into several computational threads. Therefore, firstly, the cores were optimized for operation at high clock speeds. And secondly, they were paired into dual-core modules capable of sharing their resources to serve a single thread. The result is a rather interesting design: the input part of the execution pipeline of such a dual-core module is common, and further instruction processing is divided between two sets of execution devices.

The basis of the Bulldozer design is what is conventionally called a dual-core module

Recall that the data processing process in modern processor includes several stages: fetching x86 instructions from cache memory, decoding them - translating them into internal macro-operations, execution, recording the results. The first two stages in the Bulldozer module are performed for a pair of cores together, and then for integer instructions, execution is distributed across two cluster cores or, in the case of real arithmetic, it is carried out in a block of floating point operations common to two cores.

Bulldozer modules are designed to process four instructions per clock cycle, and, thanks to macro merging technology, some pairs of x86 instructions can be considered by the processor as one operation. This means that, in general, the dual-core Bulldozer module is similar in power to a single core of modern Intel processors, which can also process four instructions per clock cycle and also support macro merges.

However, there are significant differences between the Bulldozer module and the Sandy Bridge core that can call their approximately the same theoretical speed into question. Due to the fact that the module of the new AMD processors contains the remains of two equal cores, it can demonstrate maximum performance only when processing a pair of threads. If it bears a single-threaded load, then the speed of its service will be limited by the number of execution devices within one such cluster. And there are not so many of them, given AMD’s desire to simplify individual cores - one and a half times less than in processors with Sandy Bridge or K10 microarchitecture. That is, two arithmetic ALUs and two address AGUs.

This is what the functional structure of a module built on the Bulldozer microarchitecture looks like. From two cores there are only two sets of integer actuators left

The block of floating point operations common to the processor module is also relatively low in complexity. It includes two 128-bit FMAC execution units, which can be combined into a single unit to process 256-bit instructions. It would seem that there are not so many actuators here, especially considering that they are divided into a pair of cores. But they are more universal than in previous and competing microarchitectures, which use separate multipliers and adders. And thanks to this, in certain cases when working with real numbers, the dual-core Bulldozer module can provide comparable and even more high performance than, for example, one Sandy Bridge core.

A similar idea of ​​combining 128-bit devices to work with 256-bit instructions is used in Sandy Bridge

However, the Bulldozer module should show its greatest strengths under a dual-thread load. One Sandy Bridge core is also capable of processing two computational threads; for this it has Hyper-Threading technology. However, all instructions are sent to one set of actuators, which in practice causes numerous collisions. The Bulldozer module contains two independent integer clusters that can execute threads in parallel, and the total number of execution devices in them exceeds the number of such devices in the Sandy Bridge kernel by one and a half times.

On the left is the Bulldozer module, on the right is some competing core with Hyper-Threading support. In fact, it doesn’t look much like Sandy Bridge, but the illustration conveys the essence of the problem

As a result, the Bulldozer module has higher peak performance than the Sandy Bridge core, but this performance is somewhat more difficult to unlock. The Sandy Bridge core intelligently loads its own resources thanks to advanced on-chip logic that independently parses single-threaded code and executes it in parallel on its full set of execution units. In Bulldozer, the task of effectively using actuators is partially shifted to the programmer, who must split his code into two threads - full download of all module capacities will become possible only then.

And that’s what’s typical. When considering the dual-core Bulldozer processor module, we constantly compared it with a single Sandy Bridge core, and at the same time we were able to draw quite correct parallels. This makes us wonder: shouldn’t the “eight-core” nature of the new microarchitecture be considered a product of the imagination of marketers? AMD says that cores should be counted by the number of integer clusters, arguing that the module can provide up to 80% of the performance of two independent cores. However, we should not forget that the cores on which Bulldozer is based are significantly simpler than the cores of other processors. Therefore, the number of dual-core modules is a characteristic that reflects the performance of Bulldozer much more adequately.

Find the maximum number of processor cores and get a job in the AMD marketing department

⇡ Cache memory

The organization of cache memory in Bulldozer processors is also “tied” not so much to individual cores, but to dual-core modules. In fact, each core is allocated only its own first-level data cache; all other levels of cache memory relate either to the module as a whole or to the processor:

  • Each core has its own L1 cache for data. Its volume is 16 KB, and the architecture assumes the presence of four associative channels. This cache operates with a write-through algorithm, which means it is inclusive.
  • The first level cache for instructions is provided in a single copy for each dual-processor module. Its volume is 64 KB, and the number of associativity channels is two.
  • The second level cache is also implemented in a single instance per module. Its size is an impressive 2 MB, associativity is 16 channels, and the operating algorithm is exclusive.
  • In addition, the eight-core processor as a whole has an 8-megabyte L3 cache with 64-channel associativity. The peculiarity of this cache is that it operates at a significantly lower frequency compared to the processor itself, which is about 2 GHz.

The following table describes the ratio of cache memory volumes for eight-core Bulldozer, four-core Sandy Bridge and Thuban processors (six-core Phenom II X6, built on the K10 microarchitecture).

Cache type Bulldozer (8 cores/4 modules) Sandy Bridge (4 cores) Thuban (6 cores)
L1I (instructions) 4x64 KB 4x32 KB 6x64 KB
L1D (data) 8x16 KB 4x32 KB 6x64 KB
L2 4x2 MB 4x256 KB 6x512 KB
L3 8 MB, 2.0-2.2 GHz 8 MB, runs at processor speed 6 MB, 2.0 GHz

As you can see from the table, AMD relied on capacious upper-level caches, which can be really useful in the case of a serious multi-threaded load. However, the cache memory in new processors is generally slower than that of previous and competing products. This is easily detected when measuring practical latency.

Large delays when accessing data in Bulldozer can only be compensated by the high clock speed of these CPUs. Which, however, was originally planned - in terms of frequencies, the new eight-core processors were supposed to exceed the Phenom II by 30%. However, AMD was never able to design semiconductor crystals capable of operating stably at such high frequencies. As a result, high cache latency can cause significant damage to Bulldozer-based systems.