the-super-tiny-compiler

2020-11-18 · 11 min read

原文

Most compilers break down into three primary stages: Parsing, Transformation,and Code Generation 大多数编译器的步骤分为三个部分:解析、转换和代码生成

  1. Parsing is taking raw code and turning it into a more abstract representation of the code.

解析将处理原始代码,并处理成一个抽象的表示

  1. Transformation takes this abstract representation and manipulates to do whatever the compiler wants it to.

转换用抽象的表示来做编译器需要做的操作

  1. Code Generation takes the transformed representation of the code and turns it into new code.

代码生成将转换好的抽象表示生成最终的代码

Parsing 解析

Parsing typically gets broken down into two phases: Lexical Analysis and Syntactic Analysis. 解析一般分为词法分析语法分析

  1. Lexical Analysis takes the raw code and splits it apart into these things called tokens by a thing called a tokenizer (or lexer).

词法分析提取原始代码,并通过称为标记器(或词法分析器)的东西将其拆分为标记(词法)。

Tokens are an array of tiny little objects that describe an isolated piece of the syntax. They could be numbers, labels, punctuation, operators, whatever. 词法就是一组描述一条语句的独立小片段。它们可能是数字、标识符、符号、运算符等。

  1. Syntactic Analysis takes the tokens and reformats them into a representation that describes each part of the syntax and their relation to one another. This is known as an intermediate representation or Abstract Syntax Tree.

语法分析用这次拆分的词法片段,将其重新格式化为一种格式来表示语句的每个部分和各个词法之间的关系,即抽象语法树。

An Abstract Syntax Tree, or AST for short, is a deeply nested object that represents code in a way that is both easy to work with and tells us a lot of information. 一个抽象语法树(AST),是一个用于表示代码的深度嵌套的对象,既能方便的操作,又可以提供很多关于代码的信息。

/ For the following syntax:
   (add 2 (subtract 4 2))
    Tokens might look something like this: 
 /
   [
     { type: 'paren',  value: '('        },
     { type: 'name',   value: 'add'      },
     { type: 'number', value: '2'        },
     { type: 'paren',  value: '('        },
     { type: 'name',   value: 'subtract' },
     { type: 'number', value: '4'        },
     { type: 'number', value: '2'        },
     { type: 'paren',  value: ')'        },
     { type: 'paren',  value: ')'        },
   ]

/  And an Abstract Syntax Tree (AST) might look like this: /
  {
    type: 'Program',
    body: [{
      type: 'CallExpression',
      name: 'add',
      params: [{
        type: 'NumberLiteral',
        value: '2',
      }, {
        type: 'CallExpression',
        name: 'subtract',
        params: [{
          type: 'NumberLiteral',
          value: '4',
        }, {
          type: 'NumberLiteral',
          value: '2',
        }]
      }]
    }]
  }

Transformation 转换

The next type of stage for a compiler is transformation. Again, this just takes the AST from the last step and makes changes to it. It can manipulate the AST in the same language or it can translate it into an entirely new language. 编译器的下一个阶段类型是转换。同样,这只是从最后一步获取AST并对其进行更改。它可以使用相同的语言来操作AST,也可以将其翻译成全新的语言。

Let’s look at how we would transform an AST. 让我们看一下如何转换AST。

You might notice that our AST has elements within it that look very similar. There are these objects with a type property. Each of these are known as an AST Node. These nodes have defined properties on them that describe one isolated part of the tree. 您可能会注意到,我们的AST中包含看起来非常相似的元素。这些对象具有type属性。这些都称为AST节点。这些节点在其上定义了描述树的一个孤立部分的属性。

  / We can have a node for a "NumberLiteral": /
    {
      type: 'NumberLiteral',
      value: '2',
    }
  / Or maybe a node for a "CallExpression": / 
    {
      type: 'CallExpression',
      name: 'subtract',
      params: [...nested nodes go here...],
    }

When transforming the AST we can manipulate nodes by adding/removing/replacing properties, we can add new nodes, remove nodes, or we could leave the existing AST alone and create an entirely new one based on it. 转换AST时,我们可以对结点属性进行添加/删除/替换等操作,我们可以添加新节点,删除节点,也可以不使用现有的AST,而基于它创建一个全新的AST。

Since we’re targeting a new language, we’re going to focus on creating an entirely new AST that is specific to the target language. 由于我们定位的是新语言,因此我们将专注于创建特定于目标语言的全新AST。

Traversal

In order to navigate through all of these nodes, we need to be able to traverse through them. This traversal process goes to each node in the AST depth-first. 为了浏览所有这些节点,我们需要遍历它们。此遍历过程将对AST的每个结点进行深度优先(DFS)遍历。

 {
      type: 'Program',
      body: [{
        type: 'CallExpression',
        name: 'add',
        params: [{
          type: 'NumberLiteral',
          value: '2'
        }, {
          type: 'CallExpression',
          name: 'subtract',
          params: [{
            type: 'NumberLiteral',
            value: '4'
          }, {
            type: 'NumberLiteral',
            value: '2'
          }]
        }]
      }]
}

/ So for the above AST we would go: /
 
1. Program - Starting at the top level of the AST
2. CallExpression (add) - Moving to the first element of the Program's body
3. NumberLiteral (2) - Moving to the first element of CallExpression's params
4. CallExpression (subtract) - Moving to the second element of CallExpression's params
5. NumberLiteral (4) - Moving to the first element of CallExpression's params
6. NumberLiteral (2) - Moving to the second element of CallExpression's params

If we were manipulating this AST directly, instead of creating a separate AST, we would likely introduce all sorts of abstractions here. But just visiting each node in the tree is enough for what we’re trying to do. 如果我们直接操作此AST,而不是创建单独的AST,则可能会在这里引入各种抽象。 但是只需访问树中的每个节点就足以完成我们要尝试的操作。

The reason I use the word “visiting” is because there is this pattern of how to represent operations on elements of an object structure. 我之所以使用“访问”一词,是因为存在这种模式来表示对象结构元素上的操作。

Visitors

The basic idea here is that we are going to create a “visitor” object that has methods that will accept different node types. 这里的基本思想是,我们将创建一个“Visitor”对象,该对象的方法将接受不同的节点类型。

    var visitor = {
      NumberLiteral() {},
      CallExpression() {},
    };

When we traverse our AST, we will call the methods on this visitor whenever we “enter” a node of a matching type. 当遍历AST时,只要我们“输入”匹配类型的节点,就会在此访问者上调用对应类型的方法。

In order to make this useful we will also pass the node and a reference to the parent node. 为了使它更实用,我们还将传递该节点和对父节点的引用。

    var visitor = {
      NumberLiteral(node, parent) {},
      CallExpression(node, parent) {},
    };

As we traverse down, we’re going to reach branches with dead ends. As we finish each branch of the tree we “exit” it. So going down the tree we “enter” each node, and going back up we “exit”. 当我们往下走时,我们将到达死胡同的分支。 当我们完成树的每个分支时,我们“退出”它。 因此,我们沿着树“进入”每个节点,然后回到树“退出”。(递归)

 *   -> Program (enter)
 *     -> CallExpression (enter)
 *       -> Number Literal (enter)
 *       <- Number Literal (exit)
 *       -> Call Expression (enter)
 *          -> Number Literal (enter)
 *          <- Number Literal (exit)
 *          -> Number Literal (enter)
 *          <- Number Literal (exit)
 *       <- CallExpression (exit)
 *     <- CallExpression (exit)
 *   <- Program (exit)

In order to support that, the final form of our visitor will look like this: 为了支持这一点,我们Visitor的最终形式将如下所示:

    var visitor = {
      NumberLiteral: {
        enter(node, parent) {},
        exit(node, parent) {},
      }
    };

Code Generation 代码生成

The final phase of a compiler is code generation. Sometimes compilers will do things that overlap with transformation, but for the most part code generation just means take our AST and string-ify code back out. 编译器的最后阶段是代码生成。 有时,编译器会执行与转换重叠的操作,但是在大多数情况下,代码生成只是意味着用AST生成代码字符串。

Code generators work several different ways, some compilers will reuse the tokens from earlier, others will have created a separate representation of the code so that they can print nodes linearly, but from what I can tell most will use the same AST we just created, which is what we’re going to focus on. 代码生成器以几种不同的方式工作,一些编译器将重用较早版本的词法,另一些将创建代码的单独表示,以便它们可以线性打印节点,但据我所知,大多数将使用我们刚创建的AST,这就是我们要重点关注的。

Effectively our code generator will know how to “print” all of the different node types of the AST, and it will recursively call itself to print nested nodes until everything is printed into one long string of code. 我们的代码生成器应该知道如何有效地“打印”AST的所有不同节点类型,并且它将递归调用自身以打印嵌套节点,直到将所有内容打印到一个长代码串中为止。


Profile picture

Blogs by Leo Yang who lives and works in Chengdu writing interesting things. You should follow me on Github 😁