AMD fights back in the battle of the Quad Cores

AMD described its forthcoming quad-core processor, codenamed Barcelona, in a session at today’s Microprocessor Forum. Details of the new microarchitecture on which the processor is based (codenamed K8L) have been known for some time now. Still, the event brought some new info, and here are some highlights that I’ve culled from some of the reporting on it. (Read more for links).

As I’ve reported before, AMD’s new processor is a bona fide quad-core part, with all four cores integrated onto a single die and sharing a 2MB on-die L3. Each core has private 512KB L2 and a 64KB L1 caches. In contrast, Intel’s quad-core Kentsfield part is a dual-chip module (DCM), where two Core 2 Duo dies are sandwiched together and put into the same package without sharing a cache.

K8L’s floating-point and SSE units are now 128-bits wide, so AMD can now offer single-cycle throughput for 128-bit computation.

In addition to the floating-point and SIMD improvements, K8L has the following enhancements in common with Core:

  • Improved branch prediction unit: Like Core, K8L sports larger tables and greater prediction accuracy in its BPU.
  • Dedicated stack engine: Intel first started using a dedicated stack engine to pull stack-related ESP updates out of the instruction stream with Pentium M. AMD finally appears to be going the same route with K8L. This choice frees up dispatch and execution bandwidth.
  • Memory disambiguation: I’m not 100% sure that K8L features the same kind of memory disambiguation technique as Core, but new processor’s ability to reorder loads looks to be similar.

So how will K8L stack up to Core?

Placing bets

People always want to know how these sorts of things affect the Intel/AMD performance horse race, so I’ll go ahead and stick my neck way out there with some predictions. But before I do, let me note that there are two factors to consider this time around:

  1. System architecture
  2. Core architecture

Let’s take core architecture first, because it’s the one that’s most familiar from past Intel-AMD comparisons. Note that I’m taking a preliminary stab at comparing these two architectures, and that my knowledge of K8L isn’t nearly as thorough as my knowledge of Core. So I reserve the right to change my opinion once I get more details about AMD’s new architecture. (If anyone at AMD wants to send me an optimization manual or a whitepaper, feel free.)

K8L will be a major improvement over the Opteron, but it still won’t match Conroe in vector code. With macro-ops and micro-ops fusion, Conroe can do up to six 128-bit SSE operations per cycle, or twice K8L’s peak of three SSE operations per cycle (two SSE ops + one SSE MOV). However, it would be extremely simplistic to say that Conroe has double the potential SSE performance, because you have to have a very particular mix of instructions to attain that theoretical peak. So it’s likely that real-world averages will put the two cores closer to each other in terms of overall number of SSE operations/cycle executed.

If you just compare floating-point addition and multiplication, both Core and K8L can do four (packed) double-precision operations per cycle (2 x fadd + 2 x fmul) or eight (packed) single-precision operations per cycle (4 x fadd + 4 x fmul). The fact that there’s FP/SSE MOV hardware on each of Core’s three main issue ports will give the Intel part an edge in handling memory traffic, though.

For scalar floating-point, the FPUs of the two cores are quite comparable (i.e. 128-bit FP/SIMD registers and datapaths, and split FADD/FMUL pipes). Both cores can do double-precision floating-point operations with a single-cycle throughput.

I imagine that the integer capabilities of the two cores will be comparable, though K8L’s on-die memory controller, with its lower memory access latencies, may give the AMD part a slight edge in integer performance. Also, K8L’s cache heirarchy will help with this, but to what degree is impossible to say in advance of the benchmarks.

Grapefruit-to-oranges

Benchmark bakeoffs between systems based on AMD and Intel technology are increasingly becoming what I like to call “grapefruit-to-oranges” comparisons. The topologies of systems based on the two vendors’ hardware are fundamentally different in ways that I’ve discussed before and won’t recap here. So the kinds of system-level factors that reviewers, analysts, and users typically want to ignore will play a big role in how both types of systems perform on different categories of applications.

Of course, system architecture has always mattered for performance bakeoffs and system comparisons, but the importance of this factor increases as systems from AMD and Intel start to look more and more different.

For instance, a fast memory subsystem, like that afforded the Opteron by its on-die memory controller and glueless interconnect scheme (especially in the case of multi-socket systems) could fill some gaps in performance and total system power consumption that might’ve otherwise opened because of differences in core architecture. But realistically, I expect coherent HyperTransport and NUMA to have much more of an impact in two-socket systems, because in single-socket systems the two cores’ prefetching schemes and cache heirarchies will play a more prominent role in overall performance.

Indeed, AMD’s multisocket 4×4 systems will feature a NUMA memory architecture, a high aggregate bandwidth among all the sockets and memory, and a fairly high level of coprocessor integration that Intel’s Conroe-based systems cannot yet match. All of this solid integration, from the die-level (where AMD’s four cores share an L3) out to the memory and GPUs will help AMD make up for disparities that exist in the theoretical performance of the individual core microarchitectures.

To sum up in regular English, my preliminary conclusion is that Intel has a more powerful core microarchitecture than AMD, but AMD’s die- and system-level integration will be superior when K8L launches. This makes it much harder to predict how Intel- and AMD-based at similar price points will perform on different application types, and it makes it especially difficult for reviewers to interpret benchmark results intelligently after the fact.

Because systems from the two vendors are so different, the most worthwhile way to do a bakeoff will be to pick a price range and benchmark two similarly-outfitted machines within that range against each other. What will most useful for consumers of product reviews will be to see how similarly priced systems stack up against each other in terms of both performance and power consumption (measured at the wall outlet). This may mean putting a 4×4 system with four cores (2 cores x 2 sockets) against a Kentsfield system (4 cores x 1 socket). Or, it may mean that systems with different numbers (and possibly even brands) of GPUs go head-to-head on common application suites. Whatever happens, the overall performance per watt per dollar of a fully configured system is now the most important measurement of success.

Source: arts technica

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s