Compilation principle - techintroduce

Basic concepts

The principle of compilation is the science and technology of translating high-level programming languages. We all know that computer programs are written in programming languages. In the early days, computer programming languages developed relatively Slow, because the data stored by the computer and the program executed by the computer are composed of 0 and 1 codes, so in the early days when programmers write computer programs, they must know the underlying instruction code of the computer well by combining and arranging these microprogram instructions to complete A program with a specific function puts very high demands on the programmer. People have been studying how to efficiently develop computer programs to lower the threshold of programming.

Compiler

C language compiler is a kind of modern equipment, which needs the help of computer compiler. The design of C language compiler is a relatively professional work. Designers need to consider the cumbersome design process of computer programs, as well as the needs of computer users. The types of computers are constantly increasing, so when designing a C language compiler, we must increase its applicability. C language has strong processing power, it is a structured language, and it is used more in computer system maintenance. C language has the advantage of high efficiency, and it is used more in different types of computers.

C language compiler front-end design

The compilation process is generally implemented in a computer system, which is the process of converting the source code into a computer general language. The compiler contains the address, name, and machine code of the entry point. Compiler is a tool that is widely used in computer programs. When designing the front-end of the compiler, influencing factors must be fully considered, as well as lexical, grammatical, and semantic analysis.

1 Lexical analysis

Lexical analysis is the basic stage of the front-end design of the compiler. At this stage, the compiler will mark the source program according to the set grammatical rules. In the process of marking, each mark represents a type of word. In the process of marking, there are mainly identifiers, keywords, special symbols and other types. The compiler includes a lexical analyzer, input source program, and output recognition. Marker, use these functions to convert font size into familiar words.

2 Syntax analysis

Syntax analysis refers to the use of set grammatical rules to identify the structure in the token, which includes sentences, phrases, etc., in the identification process, A special structure syntax tree can be formed. Syntax analysis has an important influence on the function of the compiler. During the design process, the accuracy of the logo must be guaranteed.

3 Semantic analysis

Semantic analysis also requires the use of grammatical rules. When checking the static semantics of grammatical units, it is necessary to ensure the accuracy of the grammatical rules. When transforming lexical or grammar, we must ensure the legitimacy of the grammatical structure setting. When checking grammar and morphology, if the grammatical structure is set unreasonably, the problem of compilation error will occur. Front-end design requires better accuracy, and designers can do proofreading work, which will affect the accuracy of compilation. If there are errors in the front-end design, it will affect the effect of C language compilation.

The overall design of the C language compiler

The C language compiler is an advanced device that can convert tedious and complex programs into machine language, and has the function of simplifying the complex In the process of dividing the C language compiler, it is necessary to understand the principles of grammar composition, the designer needs to flexibly grasp the knowledge of language grammar, and also apply the C language code conversion method. When the overall design of the C language compiler is carried out, it is necessary to start from Start with the following aspects.

1 Lexical analysis

C language programs have a certain complexity. Grammatical analysis is the primary stage of compilation. This process is mainly to scan the lexical, so the lexical analysis stage, compilation The scanner is also used as a scanner. Its main task is to connect the characters in the source program, and recognize the words in it, and convert the vocabulary, convert it into an internal code, and analyze its grammar With mark. In the process of compiling, word symbols are generally converted into binary form. The word categories mainly include binary, separator, word, etc. When the computer is working normally, the scanner will automatically complete the lexical scanning work, and this process will also Automatically delete comments, automatically recognize words in the source program, and convert them to internal codes. From the perspective of working methods, the compilation process and grammar belong to two interface methods. From a functional point of view, the lexical analysis process of the compiler is mainly to convert the corresponding character source program into the form of a word string.

2 Semantic analysis

After converting the compiled program into an internal representation, we call this internal representation an intermediate language or intermediate code, which has a clear meaning, The structure is simple and belongs to a marking system. For example, some compilers basically have no intermediate code, but in order to ensure the independent operation of the machine in practical applications, and to facilitate the optimization of the target code, an intermediate language is set in many compilers. This intermediate language is between machine language and source programming language, and the program is relatively complex, but the C language compiler has changed the above status to a large extent. Its semantic analysis and syntax analysis are relatively mature, and the theory and algorithm are relatively complete. The problem is that there is no recognized formal system, and the intermediate code is still close to a formal method.

3 Syntax analysis

Syntax analysis is mainly based on the source program in the form of word strings as the analysis and input objects. Its most fundamental goal and task is to use the grammatical rules of the design language as Standards, specific analysis of the grammatical structure of the source program, according to the grammatical rules of the design language, analysis of the grammatical components of these source programs, such as functions, subscript variables, various program statements, various expressions, etc., And to pass the correctness of the grammar check, the intermediate code is processed in stages. However, one thing to note is that the reduction process is carried out according to the needs, and there must be corresponding grammatical errors. Then you can delete all the input symbols, change the above pattern, perform shifting and reduction analysis, and on this basis, continue Look for a corresponding strategy to form an effective grammatical analysis method.

Compilation Principle Course

As an important professional course for computer majors, "Compilation Principles" is the foundation for in-depth study of professional domain knowledge in the future. As a professional course of computer science and technology, this course combines the knowledge of multiple disciplines such as discrete mathematics, data structure, operating system, and computer composition principles. It is a comprehensive and theoretical course. Due to the content of the principle of compilation With the above characteristics, there are still some problems in experimental teaching. If you want to design and produce a complete compiler in the experimental part of the compilation principle, the experiment period is long, there are many modules involved, and the connection between the modules is more complicated, and it is not easy to see the overall effect immediately.

Problems in the teaching of traditional compilation principles course

The first language course for undergraduates majoring in computer science is the C language. C language is prone to some elusive errors due to its type insecurity, making it difficult for students to locate and solve problems. If students can accurately locate the type and location of errors in the program according to the prompt information provided by the compiler, and apply what they have learned in the principles of compilation to the actual C language programming needs, this will not only complete the teaching content of the course, but also improve the students' Software programming and system analysis capabilities.

Starting from the actual teaching of compilation principles in ordinary colleges and universities, the scope of curriculum coverage is generally limited to the front end of the compiler, that is, lexical analysis, grammatical analysis, and grammatical-guided translation. This includes a large number of abstract and logically complex theoretical knowledge points, such as formal language theory, formal forms, finite automata, context-free grammar, attribute grammar, and grammar-guided translation. The traditional teaching method emphasizes the instillation of knowledge points, allowing students to solve a single isolated problem, lacking the connection between the knowledge points. This teaching method of "seeing only the trees, not the forest" will greatly weaken students' enthusiasm for learning, leading to poor overall results.

Practice-oriented teaching of mutual aid compilation principles

(1) The connection between theory and practice

The source of theoretical knowledge generally has certain problems Background. It is not helpful to improve students' practical ability to conduct theoretical teaching in isolation from practical problems. The large amount of theoretical knowledge in the Compiler Principles course has a cohesive and progressive relationship. The introduction and expansion of each knowledge point is the reproduction of the solution path to the problems encountered in reality. Therefore, the entire teaching process is reproducing the evolution of this solution. The key to achieving this goal is that the teacher has changed from "standing in front of the podium" to "sitting among students" now. This change is not just simply leaving all the questions to the students, from "teaching" to "answering questions", but also raising requirements for teachers from all aspects of problem design, thinking and enlightenment, discussion and guidance to process management. In particular, the development of modern high-level languages is changing with each passing day, and various new questions emerge in an endless stream. When facing open and unknown problems, how to give students the way to solve the problem from a systematic and overall perspective, instead of giving specific answers to each question , This is a new test of teachers' abilities.

(2) Compiler design and implementation plan

In the ideal situation of teaching compilation principles, students should be able to independently complete the construction of a small compilation system. In actual teaching, students only need to understand the key principles of knowledge, such as the determinization of NFA, the construction of FIRST and FOLLOW sets in LL (1) grammar, and the identification of living prefix DFA structure in LR (1) grammar. Meet the course examination requirements. However, theoretical learning alone is far from enough to implement a basic compiler. Compared with the students' acceptance of theoretical knowledge, the lack of students' ability to complete the compilation system on their own initiative is even more obvious. How to face all students and work out a set of applicable practical programs is the key to the actual effectiveness of the course.

The development of compilation technology

In the early days of von Neumann's computer (1940s), programs were written in machine language, and machine language was actually stored 01 code, written The procedure is very boring. Later, assembly language replaced machine language with a symbolic form of operation instructions and address coding. However, assembly language still has many shortcomings. It is difficult to read and understand, and it must depend on a specific machine. If you want to make a written program run on another computer, you must rewrite it. In the 1950s, John Backus of IBM led a research team to develop the FORTRAN high-level language and its compiler. The automatic generation tools for compilers are beginning to take shape, and now many automatic generation tools have been widely used, such as the syntax analysis tool LEX, the language analysis program YACC and so on. In the 1960s, people continued to use self-compilation technology to construct compilers, that is, the compiled language itself was used to implement the language's compiler, but its basic principles and structure were roughly the same. After continuous development, modern compilation technology has become more mature, and the rapid development of many high-level languages is inseparable from the progress of compilation technology.

The basic process of compilation

Compilation can be divided into five basic steps: lexical analysis, syntax analysis, semantic analysis and intermediate code generation, optimization, and target code generation. These are the basic steps and processes that every compiler must, input high-level language source programs from the source and output target language codes.

Compilation principle

1 Lexical analysis

The lexical analyzer scans the string constituting the source program from left to right through the lexical analysis program, reads character by character, and recognizes each word symbol , The recognized symbol is generally output in binary form, which includes the code of the symbol type and the value of the symbol. Lexical analyzers generally exist in the form of functions for the parser to call. Of course, an independent lexical analyzer program can also exist. The program that completes the task of lexical analysis is called the lexical analysis program or lexical analyzer or scanner.

2 Syntax analysis

Syntax analysis is the second stage of the compilation process. The task of this stage is to combine the recognized word symbol sequence into various grammatical phrases based on lexical analysis, such as "sentence", "expression", etc. The main step of the grammatical analysis program is to determine whether the source program sentence conforms to the definition The grammatical rules of the grammar are correct in the grammatical structure. And a grammar rule is also called grammar. Chomsky divides grammar into type 0, type 1, type 2, and type 3 grammar based on different restrictions imposed. Type 0 grammar is also called phrase grammar, and type 1 is called context-sensitive grammar. , Type 2 is called a context-free grammar, and Type 3 grammar is called a regular grammar, and the restriction conditions increase sequentially.

3 Semantic analysis

Lexical analysis focuses on whether each word is legal, and which part of the language the word belongs to. The context-free grammar of grammatical analysis focuses on whether the input sentence can match the production according to the grammar. Then, semantic analysis is to understand whether the relationship between each grammatical unit is legal. In actual application, it is to conduct context-sensitive nature review and type review of source programs that are structurally correct.

4 Intermediate code generation and optimization

After the syntactic and semantic analysis phases of work, some compilers turn the source program into an internal representation. The internal representation is called intermediate language or intermediate representation or intermediate code. The so-called "intermediate code" is a symbol system with simple structure and clear meaning. The complexity of this symbol system lies between the source programming language and the machine language, and it is easy to translate it into object code. In addition, machine-independent optimization can also be performed at the intermediate code level.

5 Generation of target code

According to the optimized intermediate code, effective target code can be generated. Usually, the compiler translates it into assembly code, and at this time, the assembly code needs to be assembled into the machine language of the target machine by the assembler.

6 Error handling

At each stage of compilation, errors in the source code may be found, especially during the syntax analysis stage, a large number of errors may be found, so the compiler needs to make error handling. Report information such as the type of error and the location of the error.

Overview of the compilation process

The running structure in the memory when the C language source program and the corresponding executable program are executed, the most important process to realize this conversion is compilation.

The source program is for people to see. It is essentially a text file. It can be opened and written with a text editing program such as vi in Linux or Notepad in Windows, but the computer cannot directly execute the source program. , It is necessary to compile the source program into a computer executable program through a special program, and this special program is the compiler. The compilation process is mainly divided into lexical analysis, syntax analysis, intermediate code generation, and target code generation (ignoring preprocessing, semantic analysis, optimization, etc.). Below we briefly explain the main process of compilation in turn.

Lexical analysis

The source code seen in human eyes is this:

int fun(int a,int b); int m=10 ;Int main(){int int i=4;int j=5;m=fun(i,j);return 0;int a,int bun(int a,int c=0;c=a+b;return c;_}

And its stored form in the computer is shown in Figure 1-36.

Figure 1-36 The hexadecimal form of the case program

There are no keywords, operators, identifiers, or even Which characters belong to the same symbol. The first stage of compilation is lexical analysis, the purpose of which is to identify symbols one by one in consecutive characters, and to identify the attributes of the symbols as much as possible.

When people look at the C language source program, they can see identifiers and keywords at a glance with the help of spaces and brackets. Compared with humans, the intelligence of computers nowadays is still quite low. It cannot read multiple characters at the same time like humans do. It can only recognize characters one by one based on a very mechanical "electronic version" of morphology. The "electronic version" morphology is embodied by integrating the ideas of the state transition diagram into the code of the lexical analyzer. The lexical analyzer recognizes tokens from the string of the source program and saves them in order.

The result of lexical analysis is shown in Figure 1-37.

Figure 1-37 Results of lexical analysis

In the lexical analysis stage, the meaning of some symbols can be recognized, including keywords, numbers, and characters String, separator, the meaning of these symbols can be determined without the aid of other symbols. For example, "int" must represent an integer type. It cannot be a variable name or an operator.

Other symbols need to pass other symbols before and after to determine the exact meaning. For example, "m" can only be judged as an identifier in the lexical stage, but if the sentence is not analyzed, it is impossible to know whether the identifier represents a variable or a function. More detailed information can only be obtained by analyzing the sentence pattern in which the symbol is located. This part of the work is done by grammatical analysis.

Syntax analysis

If the function of lexical analysis is to identify identifiers, keywords, numbers, and operators from consecutive characters and store them as a stream of tokens, grammar The function of analysis is to identify sentences that conform to C language grammar from the symbol flow identified by lexical analysis.

Because a computer cannot look at multiple identifiers, keywords, numbers, and operators at the same time like a human, it cannot see at a glance that this is a function declaration, it is an if statement, and it can only be very Clumsyly identify one symbol by symbol. Similar to the lexical analyzer, the grammatical analyzer is also based on the grammar represented by the computer, which recognizes the sentences conforming to the C language grammar one symbol by symbol. The computer representation of the grammar is the production. In the syntax analyzer, the C language grammar generated by the production is mapped into a set of templates, and this set of templates is integrated into the program of the syntax analyzer. The function of the grammar analyzer is to match the tokens recognized by the lexical analyzer with this set of templates one by one. If a certain grammar in this set of templates is matched, a complete sentence can be identified and determined. The syntax of the sentence.

We take the function declaration statement "int fun(int a, int b);" in the case as an example to describe this process. The matching scenario with the statement template is shown in Figure 1-38.

Figure 1-38 The result of matching the fun function declaration statement with the template

This token stream finally matches the function declaration template, and the match succeeds Later, the computer considers this statement as a function declaration statement and records it in the memory in the form of a syntax tree, as shown in Figure 1-39.

Figure 1-39 The syntax tree generated by the fun function declaration statement

It is a very ingenious and foresight to record the analyzed statement in a tree structure , The choice of overall consideration. On the one hand, the tree structure can "remember" all the "meaning" of the source program. On the other hand, the tree structure is easier to correspond to the runtime structure.

From the syntax tree to the intermediate code to the target code

So far, the syntax tree has carried all the information of the source program, and the subsequent conversion work has nothing to do with the source program.

If you want to convert from the syntax tree to the target code in one step, it is feasible both in theory and in practice. However, computers have multiple CPU hardware platforms. Considering the portability of programs between different CPUs, they are first converted into a general and abstract "CPU instruction". This is the original design idea of the intermediate code. Then according to the specific selected CPU, the intermediate code is implemented into the target code of the specific CPU.

The syntax tree is a two-dimensional structure. The intermediate code is a quasi-one-dimensional structure. The conversion process from the syntax tree to the intermediate code is essentially a process of converting the two-dimensional structure into a quasi-one-dimensional structure. The intermediate code, especially RTL, can already correspond to the runtime structure one-to-one. As shown in Figure 1-40.

Picture 1-40 shows the scenario where the intermediate code corresponds to the target code

The runtime structure is also one-dimensional, as shown in the left part of Fig. 1-40 The result of the conversion will be closer to the runtime structure.

After selecting the specific CPU and operating system, the intermediate code can be converted into the target code-assembly code (we configure AT&T assembly), at this time the impact of the operating system is relatively small. Then the assembler converts the .s file into a specific object file according to the object file format of the selected operating system, which is an .o file for Linux and an .obj file for Windows. The machine instructions of the selected CPU are already in the target file.

Finally, the linker links one or more object files (library files are also object files in nature) into executable files that conform to the specified format of the selected operating system.

Through the operating system, executable programs can be loaded into memory for execution, forming the runtime structure we saw in the previous two sections.

The follow-up content of this book will explain in detail the main process of compilation: lexical analysis, syntax analysis, intermediate code to target code, then assembly and linking, and finally explain preprocessing.