Intel architecture - techintroduce

Introduction

(Today we mainly talk about the processor architecture, as for the process technology, you can refer to the "32nm" entry)

Reviewing the previous generations of processors, we It is not difficult to find that Intel has maintained its leading position in the industry for most of the time. Whether it is the early P5/P6 micro-architecture, the brilliant Core micro-architecture and the Nehalem micro-architecture processor that is about to be fully introduced to the market, it has been or It is about to promote the transformation of the entire industry.

P5 architecture

Pentium adopts P5 architecture, which proved to be a great innovation. In Intel's development history, the first-generation Pentium is definitely a milestone product. This brand has even been used today and has a history of more than ten years. Although the overall performance of the first-generation Pentium 60 is very general, not even much better than the 486DX66, when the main frequency advantage is reflected, the power displayed at this time is shocking. Pentium 75, Pentium 100 and Pentium 133, classic products once dominated the industry.

P6 architecture

In the era of Pentium, although Intel has maintained a leading position in processor microarchitecture, Intel has not stopped its progress, so it is releasing the next generation of Pentium products. In the case of Pentium II, Intel adopted a patent-protected P6 architecture. The biggest difference between the P6 architecture and Pentium's P5 architecture is that the L2 cache previously integrated on the motherboard has been transplanted into the processor, which greatly accelerates the data read and hit rate and improves performance.

Common architecture

NetBurst architecture

NetBurst microarchitecture is the successor of P6 microarchitecture, the first to use this architecture is Willamette core, in 2000 Launch. Willamette is the core of the first-generation Pentium IV processor, and all Pentium IV processors use Netburst micro-architecture. The Foster (Xeon processor) launched in 2001 also uses this architecture. At the same time, Celeron and Celeron D based on Pentium IV, as well as dual-core Pentium D and Pentium Extreme Edition all use this architecture.

The Intel NetBurst microarchitecture designed based on performance has increased the frequency by more than 40%. Although the IPC value is low, the increase in frequency makes up for the deficiency (performance = frequency × IPC), and is the final The user provides a higher overall performance. Like the P6 microarchitecture, the Intel NetBurst microarchitecture relies on out-of-order speculative execution. Although the branch prediction algorithm is quite accurate, it cannot be 100% correct.

Intel architecture (9 photos)

In order to minimize the loss caused by branch misprediction and maximize the average IPC, the extended deep pipeline technology is adopted The Intel NetBurst microarchitecture greatly reduces the number of branch prediction errors and provides a quick way to recover from these errors. In order to minimize the loss caused by mispredictions, Intel NetBurst microarchitecture implements an advanced dynamic execution engine and an execution tracking cache.

However, it is worth mentioning that the Intel NetBurst micro-architecture uses super-pipeline technology, which doubles the depth of the pipeline compared to the P6 microprocessor architecture, but later practical applications show that the pipeline is improved After the length, the execution efficiency will be greatly reduced.

The only way to remedy this problem is to increase the main frequency and increase the secondary cache capacity again.

However, due to the limitations of the processor’s process at that time, the room for the processor’s main frequency to be improved is getting smaller and smaller. At the same time, the huge cache capacity is also a burden, which not only increases the cost, but also Make the heat rise sharply. This makes it necessary for Intel to make new and fundamental adjustments to the processor microframe in time.

Core microarchitecture

Because the NetBurst architecture can no longer meet the needs of future processor development, Intel launched the innovative Core microarchitecture in 2006.

1. The pipeline efficiency has been greatly improved.

The processor research and development idea of the supreme frequency has obviously been eliminated. The Core micro-architecture processor shortens the super-pipeline to 14 stages, which will greatly improve the overall efficiency. In addition, the Core microarchitecture uses four sets of instruction compilers, which means that four x86 instructions can be compiled in a single frequency cycle. The four sets of instruction compilers consist of three sets of simple compilers (Simple Decoder) and a set of complex compilers (Complex Decoder). Among the four sets of instruction compilers, only the complex compiler can process complex x86 instructions consisting of up to four microinstructions. If unfortunately encounter a very complex instruction, the complex compiler must call the Microcode Sequencer (Microcode Sequencer) in order to obtain the microinstruction sequence.

In order to cooperate with the ultra-wide compilation unit, the instruction fetch unit of the Core microarchitecture fetches six x86 instructions from the first-order instruction cache to the instruction compilation buffer ( Instruction Queue), determine whether there is a pair that conforms to the macro instruction fusion, and then send up to five x86 instructions to the four sets of instruction compilers. The four-group instruction compiler sends four compiled micro-instructions to the Reservation Station in each frequency cycle, and the reservation station then dispatches the stored micro-instructions to five execution units.

Because the instruction length, format and addressing mode of the x86 instruction set are quite confusing, the design of the x86 instruction decoder is very difficult. But today's situation has changed. On the one hand, the high frequency has a great dependence on the four-group streamlined structure. On the other hand, other auxiliary technologies can also make up for the problem of solving the chaotic addressing mode to a large extent. There is no doubt that this initiative of Intel will be a milestone in the design of the processor core architecture.

2. Brand new integer and floating point units

From P6 to NetBurst architecture, the changes in integer and floating point units are still quite obvious, but the changes in the Core micro-architecture are also not small, but some key technologies are changed back to P6 Design in the age of architecture. Core has 3 64-bit integer execution units, each of which can perform 64-bit integer arithmetic operations individually.

It is the first time for Intel x86 processors to be able to perform 64bit integer operations independently, which also allows Core to be in the forefront of its competitors. In addition, 64-bit integer units use independent data ports, so Core can simultaneously complete 3 sets of 64-bit integer operations in one cycle. The extremely strong integer arithmetic unit enables Core to play a wide and powerful role in games, server projects, mobile, etc.

In the previous NetBurst architecture, the performance of the floating-point unit is very general, and the Core architecture has made a lot of improvements to address this problem. The Core architecture has two floating-point execution units that simultaneously process vector and scalar floating-point operations. One floating-point unit performs simple processing such as addition and subtraction, while the other floating-point unit performs multiplication and division operations. Although it cannot be said that the Core architecture has greatly improved floating-point performance, the effect of its improvement is still obvious.

3. Data pre-reading mechanism and cache structure

The pre-reading mechanism of the Core micro-architecture has more new features. The data prefetching unit often needs to search for tags in the cache. In order to avoid the high delay that may be caused by the label search, the data prefetch unit uses the storage interface to perform the label search. Storage operation is not the key to affecting system performance in most cases, because when data starts to be written, the processor can immediately start the following work without waiting for the completion of the write operation. The cache/memory subsystem is responsible for the entire process of writing data to the cache and copying to the main memory.

In addition, the Core architecture uses the Smart Memory Access algorithm, which will help the processor achieve higher efficiency between the front side bus and memory transfer.

Intel architecture

The cache system of the Core architecture is also impressive. The dual-core Core architecture has a secondary cache capacity of up to 4MB, and the two cores are shared, and the access delay is only 12 to 14 clock cycles. Each core also has a 32KB first-level instruction cache and a first-level data cache, and the access delay is only 3 clock cycles. The trace cache (Trace Cache) introduced from the NetBurst architecture has disappeared in the Core architecture. The trace cache in the NetBurst architecture is similar to the common instruction cache. It is used to store the instructions before decoding. It is very useful for the long pipeline structure of the NetBurst architecture. After the Core architecture returns to the relatively short pipeline, the trace cache The cache also disappears.

Nehalem micro-architecture

After experiencing the glory of Core micro-architecture, Intel continued its efforts and launched a new Nehalem micro-architecture at the end of 2008, which is basically built on the skeleton of Core micro-architecture , Plus the addition of SMT, 3-layer Cache, TLB and branch prediction hierarchical, IMC, QPI and support for DDR3 technologies, compared with the larger changes from Pentium4’s NetBurst architecture to Core microarchitecture, from Core microarchitecture to Core microarchitecture The changes in the basic core parts of the Nehalem microarchitecture are smaller.

1.QPI bus technology

The QPI bus used by the Nehalem architecture is based on packet-based, high-bandwidth, low-latency point-to-point interconnection technology (point-to-point interconnection technology). to point interconnect), the speed reaches 6.4GT/s (6.4G data can be transmitted per second). Each link is a 20bit wide interface that uses high-speed differential signaling and dedicated clock lanes. These clock lanes have failover. The QPI data packet is 80bit in length, and it takes 4 cycles to send. Although the data packet is 80 bits, only 64 bits are used for data, and the other data bits are used for flow control, CRC, and other purposes. In this way, each connection transmits 16bit (2Byte) data at a time, and the remaining bit width is used for CRC. Since the QPI bus can be transmitted in both directions, the theoretical maximum value of a QPI bus connection can reach 25.6GB/s (2×2B×6.4GT/s) data transmission. One-way is 12.8GB/s. (For more detailed information, please refer to the "Quick Channel Interconnection QPI" entry)

2.IMC integrated memory controller

Nehalem architecture IMC (integrated memory controller, integrated memory controller), It can support 3 channels of DDR3 memory, running at 1.33GT/s (DDR3-1333), so that the total peak bandwidth can reach 32GB/s. However, FB-DIMM is not yet supported. Nehalem EX (Beckton) may support FB-DIMM (Fully Buffered-DIMM). The memory of each channel can be operated independently, and the controller needs to be executed out of order to reduce (cover) the delay. (For more details, please refer to the entry on Integrated Memory Controller)

3.SMT

Simultaneous Multi-Threading (SMT) technology has returned to the Nehalem architecture again, which is the earliest Appeared on the Pentium IV at 130 nanometers. For the processor with SMT turned on, it will suffer more hit failures and need to use more bandwidth. So Nehalem is more suitable for SMT than Pentium IV.

Nehalem's Simultaneous Multi-Threading (SMT) is 2-way, and each core can execute 2 threads at the same time. For the execution engine, in the case of multi-threaded tasks, the delay of a single thread can be masked. The advantage of the SMT function is that it only needs to consume a small core area cost, which can provide a significant performance improvement in the case of multitasking, which is much more cost-effective than adding a physical core completely. This is the same as the previous P4 HT technology, but in comparison, Nehalem's advantage is that it has a larger cache and a larger memory bandwidth, so that it can be used more effectively. According to Intel, Nehalem's SMT can increase performance by 20-30% with little increase in energy consumption. (For more detailed information, please refer to the synchronous multithreading technology entry)

4. Newly designed cache system

Each core of Nehalem has a private general-purpose L2, which is an 8-way joint 256KB, the access speed is quite fast. Compared with its L1D, Nehalem's L2 is neither inclusive nor exclusive. It can transfer data between the two core private caches (L1D and L2), although it cannot reach full speed.

Compared with the Core micro-architecture, Nehalem adds a new layer of L3 cache, which is for the need of multiple cores to share data (Nehalem-EX has 8 cores), so this L3 has a very large capacity. Big. From the architectural point of view, the 16-way joint, 8MB L3 equipped with the Nehalem architecture processor is completely inclusive for the first two stages and shared by 4 cores. (For more details, please refer to the entry on the new cache hierarchy system)

Development pace

Intel has been following Moore's Law for a long time and has been a pioneer in leading industry innovation. Through continuous innovation of the processor architecture, this amazing speed of innovation not only improves the performance of the processor, but also provides new features and capabilities, and ultimately meets the growing needs of users. We are very concerned about this kind of continuous development. The industry needs to be able to provide platforms with a faster and more predictable pace of innovation. These platforms are characterized by faster, more connections, trustworthy, personalized, and excellent Computing experience. With industry-leading chip expertise and architecture design capabilities that will provide strong growth drivers for the next decade and beyond, Intel has taken a coordinated and increasingly accelerated pace of architecture innovation.

What does it mean?

The pace of development refers to Intel’s strategy to introduce a new micro-architecture and a new generation of silicon process technology approximately every two years.

Intel’s continuous innovation in silicon process technology has enabled the transistor density to double approximately every two years, which provides processor designers with strong design flexibility to design better Product. In the past, design flexibility has been used to provide better performance and features while reducing power consumption. Looking to the future, the ever-increasing needs of users will require faster performance improvements and the integration of various capabilities across fuzzy usage boundaries. Therefore, this requires a solution architecture that can be expanded across a wide range of use areas, and this goal can only be achieved through industry-wide innovation. Intel's architecture and chip development pace model can provide a powerful innovation drive, not only can promote the development of new processor architectures and chipsets at a fast and coordinated pace, but also become a "catalyst" for platform-level industry innovation, providing high energy efficiency performance Various advantages.

Features

The principles adhered to by the pace of development are based on what Intel calls the "tick-tock" model of chips and microarchitectures. This model will provide a general-purpose processor architecture spanning all sizes of markets. Each "tick" represents the silicon compression frequency (beat rate), and each "tick" has a corresponding "tock", which represents the design of a new micro-architecture, which is updated approximately every two years. Intel’s global design methodology and extensive discipline are the cornerstones of its development pace principle, which supports Intel’s innovation in processors and platforms that exceed the capabilities of individual products.

A good example is Intel's huge leap in extending the notebook architecture to provide outstanding server performance. Intel has achieved this huge leap in its processors. The Intel Core(TM) 2 Duo processor is based on the Intel Core(TM) microarchitecture. Two complete execution cores are built into a physical processor, and they run at the same frequency, which can provide industry-leading energy efficiency performance for notebook computers and desktop computers. The computing element of the Intel Core(TM) microarchitecture is described as an integrated core that supports the optimization of architecture and technology to meet the demand for breakthrough performance and energy efficiency. Intel will continue to provide servers, desktops and mobile products with a general-purpose scalable architecture based on multi-core processor technology. Eventually, after continuous innovation, an architecture optimized for performance-to-power ratio and expansion capabilities was born, which will promote the development of "tick-tock" based on general chip foundations in the development of chipsets, interconnects, memory, and platforms. A series of innovations.

The core of fulfilling the promise of development pace is the ability of multiple design teams around the world to work together. This requires effective coordination between the teams to achieve mutual complementation of each other's methods and plans with the least overlap and redundancy. Intel also supports the software community and some universities to develop multi-threaded applications, and is committed to urging industry value chain vendors to fully appreciate the advantages brought by the pace of innovation. This includes the promotion of standards activities, the fit of the industry and the regulatory environment, and real efforts to meet the needs of users.

As early as the early 1990s, Intel Corporation won the industry's leading position with its 32-bit Intel processor architecture (IA32), thereby establishing an industry standard and taking the lead in Intel Pentium processing. In 1993, the introduction of the Intel Pentium processor marked the advent of the fifth generation of desktop processors. Subsequently, a series of innovations followed: in 1995, the Intel Pentium processor was launched; in 1997, the Intel Pentium II processor was launched; in 1999, the Intel Pentium III processor was launched; in 2000, the Intel Pentium based on the Intel NetBurst architecture 4 The processor is launched. In the same year, Intel also introduced the Intel Xeon processor.

In 2003, the launch of the first Intel Pentium M processor based on 90-nanometer process technology marked the transition to energy-efficient performance. The launch of the Pentium M processor marked a transition to energy-efficient performance. The power consumption ratio is used as a measurement standard and has the ability to expand. The introduction of these processors is based on the pace of chip innovation and development, not necessarily accompanied by design processes and methods.

In 2006, Intel launched the new Intel Core(TM) microarchitecture, which laid a solid foundation for Intel-based desktops, notebooks and mainstream server multi-core processors. This innovation is based on 65-nanometer process technology and is the first micro-architecture on the development path that links architecture design and chip innovation in the pace of development. Intel's architecture and chip development pace differs from other manufacturers in the industry in several respects. These aspects include:

A micro-architecture optimized for performance, capability, and energy efficiency for all large-scale markets;

Popular design reuse principles without waiting for chip density to be available Joint design team, but this pushes towards common goals and design goals;

Focus on the use of parallel chipsets and the development of industry value chain manufacturers to achieve platform-level innovation and rapidly enhance platform capabilities.

Therefore, Intel’s development pace model further consolidates the credible foundation of other chip innovations in order to provide industry-leading energy-efficient performance architecture at a pace that will drive industry-wide innovation and growth.

Intel’s "tick-tock" development strategy

The implementation of the "tick-tock" development pace for multi-core processors has always been based on the integrated core, which is the basic computing element , Able to provide target performance and capabilities as well as outstanding energy efficiency.

Therefore, "Tick-tock" needs to synchronize the design process to achieve the following innovations, which are in line with user values across various market sectors:

Lower power consumption; Multi-threaded performance ; Features and capabilities; Enhance modularity and flexibility.

The key to execution is to bring this pace of innovation to industry innovators in order to bring real benefits to users. Therefore, Intel has adjusted the pace of development related to the overall industry leadership:

Update the silicon process technology ("ticks") every two years, and at the same time, update the architecture ("tocks") every two years;

Multiple experienced design teams are committed to keeping design goals and important events in sync, and determine process touch points to maximize efficiency;

Frequent "training" of features also provides the ability to adapt to change.

Summary

By continuously following Moore’s Law, Intel has doubled the number of transistors on a silicon core almost every two years. Double the number of transistors provides great design flexibility, which in turn can improve performance, scalability, and energy efficiency. In the latest generation of Intel Core(TM) microarchitecture products, the design flexibility by increasing the number of cores brings huge performance improvements, brings excellent features/capabilities to new and improved applications, and significantly reduces power consumption . Intel is focusing on the credible and accelerating pace of innovation, and it will deliver a new micro-architecture and a silicon process technology improvement approximately every two years.

Intel's "tick-tock" architecture and chip development pace model have brought huge advantages to the industry and users, and new capabilities and solutions will meet their growing needs. At Intel, our mission is to deliver architectural innovation at the innovation speed of Moore's Law.