Parallel processing computer system

Structural features

parallel computer system

Parallel processing computer

Parallel processing computer mainly refers to the following two types of computers: ①Can simultaneously A single central processor computer that executes multiple instructions or processes multiple data items at the same time; ②Multi-processor system.

Structural characteristics of parallel processing computers

With the development of electronic devices, the processing power of computers has been significantly improved. However, the speed increase achieved only by the progress of devices is far from meeting the needs of modern science, technology, engineering and many other fields for high-speed computing capabilities. This requires people to improve the computer structure and adopt various parallel processing technologies in order to greatly increase the processing speed and problem-solving ability.

The structural characteristics of parallel processing computers are mainly manifested in two aspects: ①A variety of parallel measures are widely used in a single processor; ②The development of a single processor into a multi-processor system with different coupling degrees . The main purpose of parallel processing is to improve the processing capacity of the system. Some types of parallel processing computer systems (such as multi-processor systems) can also improve the reliability of the system. Due to the development of devices, parallel processing computer systems have a better performance-to-price ratio, and there is a trend of further improvement.

Principle of structure

The structure of parallel processing computer mainly includes pipeline mode, multi-function component mode, array mode, multiprocessor mode and data flow mode.

The pipeline processor

The instruction execution process is broken down into several sections, and each section performs a part of processing. An instruction flows through all the segments in sequence and the result is obtained after the execution is completed. When this instruction has been processed in this paragraph and enters the next paragraph, the next instruction can flow into this paragraph. Therefore, several instructions can be processed simultaneously on the entire pipeline. If the execution time of each segment is one clock beat, under normal circumstances, one result can be output per beat, that is, one instruction is completed. This can speed up the processor.

The correlation of adjacent instructions in the program will affect the efficiency of the pipeline processor. For example, the conditional branch instruction sometimes cannot determine the successor instruction before the last instruction is executed; and if this instruction needs to use the result of the previous instruction as an operand, it will interrupt the pipeline and reduce the efficiency.

Multifunctional components

A processor has multiple functional components. Each functional unit can process data in parallel, so the processor can use different functional units to execute several instructions in parallel to increase the processing speed. If some computers have floating-point addition, fixed-point addition, floating-point multiplication, floating-point division, logical operations, shifting and other functional components for processing different data. Some pipeline vector machines also contain multiple functional components. During the execution of the program, due to the unbalanced demand for various components, it is impossible for all functional components to be in a busy state. The correlation between instructions also affects the efficiency of the machine. For example, the functional components required by this instruction are still executing other instructions; and the operands required by this instruction are the results of instructions that have not yet been executed.

Array processor

A processor consists of multiple identical processing components and a unified controller. This controller interprets instructions and transmits operating commands to all processing components. Each processing component performs exactly the same operation at the same time according to the command of the controller. Array processors can be divided into floating-point array processors and bit slice array processors.

Parallel processing computer system

ILLIAC-IV machine is a floating-point array processor, including 64 identical processing units (PU) and a common control unit (CU). Each processing unit includes a processing unit (PE) capable of performing 64-bit floating point operations and a memory (PM) with a capacity of 2k words. The 64 processing units are arranged in an 8×8 array. Each processing unit has a direct data path with the neighboring processing units.

Multiprocessor system

Multiprocessor system can improve the performance and reliability of the system. It is a multi-instruction stream multi-data stream processor. According to the coupling degree of each processor in the system, the multi-processor system can be divided into two categories. ① Indirectly coupled multiprocessor system: each processor in the system has a main memory. Each processor is managed by its own operating system, and they communicate through a shared input and output system. ② Directly coupled multi-processor system: each processor in the system shares the main memory and is managed by a unified operating system. Multiprocessor systems generally refer to this category of direct coupling.

Interconnection network

In a directly coupled multi-processor system, it is very important to realize the interconnection network between the processor and the memory, and between the processor and the processor. There are three main forms of interconnection networks.

①　Bus structure: The bus structure is the simplest network structure in a multiprocessor system. The interconnection network of the actual multiprocessor system is often developed on the basis of the bus structure (Figure 3).

②　Cross switch structure: The cross switch is composed of a crossbar switch array, which connects the horizontal processor and the vertical memory module (Figure 4).

③　Multi-port memory structure: Move the switches on each cross point in the cross switch structure to the interface of the corresponding memory to form a multi-port memory structure.

Data stream processor

Data stream processor is a highly parallel processor that is valued by people. Although it retains the practice of storing programs, the main principle is different from the Neumann computer structure. It does not execute the program in the order of instructions indicated by the program counter. As long as all required operands are available, the instructions can be executed, that is, the execution of the program is not driven by control flow, but by data flow.

Data stream processor is a language-based processor. It uses data flow program diagram as the interface between user language and computer structure. The data flow program diagram is represented by an active box. Each active frame has multiple fields, which store operation code, operand and target address respectively. The data flow program is stored in the active memory in the form of a collection of active frames. When an instruction can be executed, the corresponding active frame address is sent to the instruction queue. The reading component fetches the active frame from the memory according to the address to form an operation packet, which is sent to the operation component for processing, and a result packet is generated. The modification component sends the result data to the specified active frame as the operand according to the target address of the result packet, and sends the address of the instruction with the operand to the instruction queue. The instructions in the instruction queuing device have the execution conditions, so only need to increase the number of components or enhance the degree of component pipeline, it can be executed in parallel at high speed. In addition, multiple instruction processing units can also be connected into a data stream multi-processor system to further improve processing capabilities.

Parallel algorithms and parallel languages

One of the keys to improving the efficiency of parallel processing is parallel algorithms. The algorithm must adapt to the structure of the computer. If the parallelism expressed by an algorithm is basically the same as the parallelism of the computer, the computer's problem-solving efficiency can be improved.

In vector computers, the main problem of increasing the degree of parallelism is to use vectors to represent operands that can be processed in parallel. Many commonly used numerical calculation methods, such as sequence summation, matrix multiplication, Gaussian elimination, fast Fourier transform, etc., have been successfully implemented in parallel on vector computers. The more popular parallel language is basically an extension of the FORTRAN language.

In a multiprocessor system, the key to improving program parallelism is to decompose tasks into enough processes that can be operated at the same time. In the programming language, it is necessary to expand the statement that can clearly express the concurrency of the process, so that the corresponding control mechanism can provide control and management means when the program is running, including the derivation, communication and scheduling of parallel tasks. ADA language provides the necessary sentences for describing the structure of parallel programs of multiprocessors. Several data flow languages such as Id language and VAL language that have emerged to adapt to data flow computers have been tried out. Its important feature is that it treats arrays as values rather than targets. Programs written in data flow language can naturally express the maximum parallelism of operations.

Parallel processing system classification

For the classification of parallel processing systems, the most widely used is to divide computer systems into four categories based on the multiples of the data flow and instruction flow in the computer system ：

(1) Single Instruction Single Data Stream Computer System

Single Instruction Single Data, SISD computer system, in SISD computer system, single processing The processor executes a single instruction stream to perform operations on data stored in a single accessible memory. A typical example of a SISD computer is a single-processor system, in which there is no parallel processing method.

(2) Single instruction multiple data stream computer system

Single instruction multiple data stream computer system, Single Instruction Multiple Data, SIMD computer system, in the SIMD computer system, a single instruction can Synchronously control multiple processes, each process has a related data memory, so one instruction can complete the same operation on different data groups. SIMD computer

(3)Multiple instruction single data stream computer system

Multiple instruction single data stream computer system, Multiple Instruction Single Data, MISD computer system can realize multiple sequential data The operation of the processor. Each processor executes a different sequence of instructions. There is no complete MISD computer in the actual system.

(4) Multi-instruction multi-data stream computer system

Multiple Instruction Multiple Data, MIMD computer system, in MIMD computer system, multiple processing The processor completes different instruction sequences in parallel and processes different data. MIMD computers are actually multi-processor parallel systems. Under the organization of MIMD, each processor is universal, and each processor can process all data and complete instructions for corresponding data operations. The MIMD system can also be further divided according to the communication mode of each processor. If each processor shares a memory, each processor accesses the programs and data in the shared memory and communicates with each other via the secondary memory. The most common form of this structured system is the SMP system. SMP realizes that multiple processors share a single memory or memory group through a shared bus or other means of interconnection. In the SMP storage system, the memory interface performance is basically the same for each processing unit, that is, the acquisition events for the shared storage are basically the same for each processing unit. A group of independent processors or SMPs can be interconnected by communication links or other fixed paths between computers, or even network equipment, to form a cluster system. The NUMA system implements different speeds of read/write to its internal memory, so that the access time is related to the location accessed by the processing unit.

Parallel computer system and its storage design

With the development of computer technology and the decline of hardware costs, designers have increasingly sought ways to achieve parallelism. Among them, many The processor parallel system can greatly improve the performance of the computer system. In a parallel processing system, the processing units can be executed in parallel to balance the input workload. At present, the most commonly used parallel multi-processor organization methods include symmetric multi-processor (SMP), cluster system (Cluster) and non-uniform memory (NUMA). SMP uses multiple identical or similar processors to form a multi-processor computing system in the form of a bus or switch array interconnection, which is a multi-processor computer in the usual sense. In the SMP system, the most important problem is storage consistency. Each processor has independent storage, namely high-speed cache memory. When a row of data appears in the Cache of different processors, if the content of a Cache is modified , Then the contents of other Cache and main memory should also be updated. This problem was solved in time by the Cache consistency protocol. The cluster system is formed by a group of complete computers interconnected, as a unified computing resource for data processing together, the outside of the entire cluster can be regarded as a cluster for processing. NUMA system realizes the parallel work of multi-processors by means of shared memory. The main feature of NUMA is that the memory word access time of a certain processor in the system varies with the memory word location.