Lex - 词汇分析器生成器外文翻译资料

2021-12-17 10:12

Lex - A Lexical Analyzer Generator

ABSTRACT

Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine.

Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream.

The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it.

Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable Fortran. It is available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. This manual, however, will only discuss generating analyzers in C on the UNIX system, which is the only supported form of Lex under UNIX Version 7. Lex is designed to simplify interfacing with Yacc, for those with access to this compiler-compiler system.

1. Introduction.

Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source specifications given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed.

The user supplies the additional code beyond expression matching needed to complete his tasks, possibly including code written by other generators. The program that recognizes the expressions is generated in the general purpose programming language employed for the user#39;s program fragments. Thus, a high level expression language is provided to write the string expressions to be matched while the user#39;s freedom to write actions is unimpaired. This avoids forcing the user who wishes to use a string manipulation language for input analysis to write processing programs in the same and often inappropriate string handling language.

Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called ``host languages.#39;#39; Just as general purpose languages can produce code to run on different computer hardware, Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host language appropriate to the task, the user#39;s background, and the properties of local implementations. At present, the only supported host language is C, although Fortran (in the form of Ratfor [2] has been available in the past. Lex itself exists on UNIX, GCOS, and OS/370; but the code generated by Lex may be taken anywhere the appropriate compilers exist.

Lex turns the user#39;s expressions and actions (called source in this memo) into the host general-purpose language; the generated program is named yylex. The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected. See Figure 1.

-------

Source -gt; | Lex | -gt; yylex

-------

Input -gt; | yylex | -gt; Output

-------

An overview of Lex

Figure 1

For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines.

[ \t] $ ;

is all that is required. The program contains a %% delimiter to mark the beginning of the rules, and one rule. This rule contains a regular expression which matches one or more instances of the characters blank or tab (written \t for visibility, in accordance with the C language convention) just prior to the end of a line. The brackets indicate the character class made of blank and tab; the indicates ``one or more ...#39;#39;; and the $ indicates ``end of line,#39;#39; as in QED. No action is specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change any remaining string of blanks or tabs to a single blank, add another rule:

[ \t] $ ;

[ \t] printf(' ');

The fi

Lex - 词汇分析器生成器
Lex帮助编写程序，其控制流由输入流中的正则表达式实例指导。它非常适合编辑器脚本类型转换和分段输入以准备解析例程。
Lex源是一个正则表达式和相应程序片段的表。该表被转换为一个程序，该程序读取输入流，将其复制到输出流并将输入分区为与给定表达式匹配的字符串。当识别出每个这样的字符串时，执行相应的程序片段。表达式的识别由Lex生成的确定性有限自动机执行。用户编写的程序片段按照输入流中出现相应正则表达式的顺序执行。
用Lex编写的词法分析程序接受模糊规范，并在每个输入点选择最长匹配。如有必要，对输入执行实质性前瞻，但输入流将备份到当前分区的末尾，以便用户可以自由地操作它。
Lex可以用C或Ratfor生成分析器，这种语言可以自动转换为便携式Fortran。它可在PDP-11 UNIX，Honeywell GCOS和IBM OS系统上使用。但是，本手册仅讨论在UNIX系统上以C语言生成分析器，这是UNIX版本7下唯一受支持的Lex形式.Lex旨在简化与Yacc的接口，以便访问此编译器 - 编译器系统。

1.简介。
Lex是一个程序生成器，专为字符输入流的词法处理而设计。它接受一个用于字符串匹配的高级，面向问题的规范，并以通用语言生成一个识别正则表达式的程序。正则表达式由用户在给Lex的源规范中指定。 Lex编写的代码在输入流中识别这些表达式，并将输入流分区为与表达式匹配的字符串。在字符串之间的边界处执行用户提供的程序部分。 Lex源文件关联正则表达式和程序片段。当每个表达式出现在Lex编写的程序的输入中时，执行相应的片段。
用户提供除完成其任务所需的表达式匹配之外的附加代码，可能包括由其他生成器编写的代码。识别表达式的程序是在用于用户程序片段的通用编程语言中生成的。因此，提供高级表达语言来写入要匹配的字符串表达式，同时用户的写入动作的自由度不受损害。这避免了强迫希望使用字符串操作语言进行输入分析的用户以相同且通常不适当的字符串处理语言编写处理程序。
Lex不是一种完整的语言，而是一种代表新语言特性的生成器，可以添加到不同的编程语言中，称为“主语言”。正如通用语言可以生成在不同计算机硬件上运行的代码一样，Lex可以用不同的主语言编写代码。主机语言用于Lex生成的输出代码以及用户添加的程序片段。还提供了用于不同主机语言的兼容运行时库。这使Lex适应不同的环境和不同的用户。每个应用程序可以针对适合于任务的硬件和主机语言的组合，用户的背景以及本地实现的属性。目前，唯一受支持的宿主语言是C，虽然Fortran（以Ratfor [2]的形式在过去已经可用.Lex本身存在于UNIX，GCOS和OS / 370;但Lex生成的代码可能是在适当的编译器存在的任何地方。
Lex将用户的表达和操作（在本备忘录中称为源代码）转换为主机通用语言;生成的程序名为yylex。 yylex程序将识别流中的表达式（在此备忘录中称为输入），并在检测到每个表达式时执行指定的操作。见图1。
-------
来源 - gt; | Lex | - gt; yylex
-------
-------
输入 - gt; | yylex | - gt;输出
-------
Lex的概述
图1
对于一个简单的例子，考虑一个程序从输入中删除行末端的所有空白或制表符。
%%
[\ t] $;
就是这一切。该程序包含一个%%分隔符，用于标记规则的开头和一个规则。此规则包含一个正则表达式，它匹配字符空格或制表符的一个或多个实例（在符合C语言约定的情况下，根据C语言约定编写\ t），就在行结束之前。括号表示由空白和制表符组成的字符类; 表示“一个或多个......”;并且$表示QED中的“行尾”。没有指定任何操作，因此Lex（yylex）生成的程序将忽略这些字符。其他一切都将被复制。要将任何剩余的空白或制表符串更改为单个空白，请添加另一个规则：
%%
[\ t] $;
[\ t] printf（“”）;
为此源生成的有限自动机将立即扫描两个规则，观察空白字符串或制表符的终止是否存在换行符，并执行所需的规则操作。第一个规则匹配行末尾的所有空格或制表符串，第二个规则匹配所有剩余的空格或制表符串。

Lex可以单独用于简单的转换，也可以用于词法层面的分析和统计数据收集。 Lex也可以使用标签使用解析器生成器执行词法分析阶段; Lex和Yacc接口特别容易[3]。 Lex程序只识别正则表达式; Yacc编写的解析器接受大量的上下文无关语法，但需要较低级别的分析器来识别输入令牌。因此，Lex和Yacc的组合通常是合适的。当用作后来的解析器生成器的预处理器时，Lex用于对输入流进行分区，解析器生成器将结构分配给生成的片段。在这种情况下（例如，可能是编译器的前半部分）的控制流程如图2所示。其他程序由其他生成器或手工编写，可以轻松添加到Lex编写的程序中。
词汇语法
规则规则
| |
v v
--------- ---------
| Lex | | Yacc |
--------- ---------
| |
--------- ---------
输入 - gt; | yylex | - gt; | yyparse | - gt;解析输入
--------- ---------
Lex和Yacc
图2
Yacc用户将意识到Yyc这个名称是Yacc期望其词法分析器被命名的名称，因此Lex使用这个名称简化了接口。
Lex从源[4]中的正则表达式生成确定性有限自动机。自动机被解释而不是编译，以节省空间。结果仍然是一个快速的分析仪。特别是，Lex程序识别和分区输入流所花费的时间与输入的长度成比例。除非包含前向上下文的规则需要大量重新扫描，否则Lex规则的数量或规则的复杂性在确定速度时并不重要。随着规则的数量和复杂性而增加的是有限自动机的大小，因此Lex生成的程序的大小。

在Lex编写的程序中，收集用户的片段（表示在找到每个正则表达式时要执行的操作）作为切换的情况。自动机解释器指示控制流程。为用户提供了在包含操作的例程中插入声明或附加语句的机会，或者在此操作例程之外添加子例程。
Lex不限于可以根据一个字符前瞻来解释的来源。例如，如果有两个规则，一个查找ab而另一个查找abcdefg，输入流是abcdefh，Lex将识别ab并在cd之前保留输入指针。。。这种备份比处理更简单的语言更昂贵

2. Lex来源。
Lex源的一般格式是：
{}定义
%%
{}规则
%%
{user subroutines}
其中定义和用户子程序经常被省略。第二个%%是可选的，但第一个必须标记规则的开头。因此，绝对最低限度的Lex计划
%%
（没有定义，没有规则）转换成一个程序，将输入复制到输出不变。
在上面显示的Lex程序概要中，规则代表用户的控制决策;它们是一个表，其中左列包含正则表达式（请参阅第3节），右列包含操作，识别表达式时要执行的程序片段。因此可能出现个别规则
整数printf（“找到关键字INT”）;
在输入流中查找字符串整数，并在出现时打印消息“找到关键字INT”。在此示例中，主机过程语言为C，C库函数printf用于打印字符串。表达式的结尾由第一个空格或制表符表示。如果动作只是单个C表达式，它可以只在行的右侧给出;如果它是复合的，或者超过一条线，它应该用括号括起来。作为一个稍微有用的例子，假设希望将一些单词从英语拼写更改为美国拼写。 Lex的规则如
color printf（“color”）;
mechanise printf（“mechanize”）;
汽油印刷（“汽油”）;
将是一个开始。这些规则还不够，因为石油这个词会成为gaseum;稍后将描述处理这种情况的方法。

3. Lex正则表达式。
正则表达式的定义与QED [5]中的定义非常相似。正则表达式指定要匹配的一组字符串。它包含文本字符（与要比较的字符串中的相应字符匹配）和操作符（指定重复，选项和其他功能）。字母和数字的字母总是文字字符; 因此正则表达式
整数
匹配字符串整数，无论它出现在何处和表达式
a57D
寻找字符串a57D。运营商。操作员角色是
“\ [] ^ - ？。* |（）$ / {}％lt;gt;
如果要将它们用作文本字符，则应使用转义。引号运算符（“）表示一对引号之间包含的内容将被视为文本字符
XYZ “ ”
它出现时匹配字符串xyz 。请注意，可以引用字符串的一部分。引用普通文本字符是无害的，但是没有必要;表达方式
“XYZ ”
与上面的相同。因此，通过引用用作文本字符的每个非字母数字字符，用户可以避免记住当前操作符的上面的列表，并且如果Lex的进一步扩展延长列表则是安全的。
操作符也可以通过在\前面加\来转换为文本字符
XYZ \ \
这是另一个，不太可读，等同于上述表达式。引用机制的另一个用途是在表达式中得到一个空白;通常，如上所述，空格或制表符结束规则。必须引用[]（见下文）中未包含的任何空白字符。识别出几个正常的C转义符\：\ n是换行符，\ t是制表符，\ b是退格键。要输入\本身，请使用\\。由于换行符在表达式中是非法的，因此必须使用\ n;它不需要转义制表符和退格键。每个字符，但空白，制表符，换行符和上面的列表始终是文本字符。
角色类。可以使用运算符对[]指定字符类。构造[abc]匹配单个字符，可以是a，b或c。在方括号内，大多数运算符含义被忽略。只有三个字符是特殊的：它们是\ - 和^。 - 字符表示范围。例如，
[A-Z0-9 lt;gt; _]
表示包含所有小写字母，数字，尖括号和下划线的字符类。范围可以按任何顺序给出。使用 - 在不是大写字母，小写字母或两个数字的任何字符对之间都是依赖于实现的，并且会收到警告消息。（例如，ASCII中的[0-z]比EBCDIC中的字符多得多）。如果希望包含字符 - 在字符类中，它应该是第一个或最后一个;从而
[ - 0-9]
匹配所有数字和两个符号。
在字符类中，^运算符必须显示为左括号后面的第一个字符;它表示结果字符串将相对于计算机字符集进行补充。从而
[^ ABC]
匹配除a

资料编号：[4651]

您需要先支付 30元 才能查看全部内容！立即支付

注册

找回密码