Recall from Chapter 11 that the concrete syntax of a programming language is a set of strings (i.e., sequences of characters in an alphabet). A grammar is an inductive definition for describing a inductive set of strings.
A grammar is ambiguous when there exists at least one sentence in the language that can be generated by the grammar in more than one way. What this means is that the string has multiple distinct parse trees or derivations, leading to different interpretations of the program’s tree structure.
The abstract syntax of a programming language makes explicit a program’s tree structure (sometimes also called terms).
A parser converts concrete syntax into abstract syntax, which has deal with ambiguity. A common (though not only) source of ambiguity are infix operators, which can be disambiguated by making explicit associativity and precedence.
12.1 Abstract Syntax
Consider again the grammar of expressions involving the \(\texttt{/}\) and \(\texttt{-}\) operators in Table 11.2, with subscripts to make explicit the instances of the symbols:
\[\begin{array}{rrrl}
\text{expressions} & e & \mathrel{::=}& n \mid e_1\;\texttt{/}\;e_2 \mid e_1\;\texttt{-}\;e_2
\end{array} \tag{12.1}\]
To represent expressions \(e\) in Scala, we declare the following types and caseclasses:
defined traitExpr
defined classN
defined classDivide
defined classMinus
We define a new type Expr (i.e., a trait). Each caseclass is a constructor for an expression \(e\) of type Expr corresponding to one of the productions defining the non-terminal \(e\).
If we rewrite the above grammar (Equation 12.1) to use these constructor names in each production, we get the following: \[\begin{array}{rrrl}
\text{expressions $\texttt{Expr}$} & e & \mathrel{::=}&
\texttt{N(} n \texttt{)} \\
& & \mid& \texttt{Divide(}e_1\texttt{,}\;e_2\texttt{)} \\
& & \mid& \texttt{Minus(}e_1\texttt{,}\;e_2\texttt{)} \\
\text{integers} & n
\end{array} \tag{12.2}\]
also corresponds to the sentence \(\texttt{10 - 10 / 10}\) in the first grammar (Equation 12.1) (with a different parse tree). Thus, while the first grammar is ambiguous, the second one is unambiguous.
In a language implementation, we do not want to be constantly worrying about the “grouping” or parsing of a string (i.e., resolving ambiguity), so we prefer to work with terms in this second grammar. We call this second grammar, abstract syntax, where the tree structure is evident. Observe that parentheses around each sub-expression avoids ambiguity.
Each instance of caseclass is a node in an n-ary tree, and each argument of a non-terminal type to a constructor is a sub-tree. For example, the term
And thus the first phase of language tool is the parser that converts the concrete syntax of strings into the abstract syntax of terms (i.e., trees).
Because the concrete syntax is more concise visually and human friendly, it is standard practice to give (ambiguous) grammars like the first grammar above (Equation 12.1) and treat them as the corresponding abstract syntax specification given in the second grammar (Equation 12.2). In other words, we give a grammar that define the strings of a language and leave it as an implementation detail of the parser to convert strings to the appropriate terms or abstract syntax trees. We even often draw abstract syntax trees using concrete syntax notation, such as in Figure 12.1 (a).
12.2 Parsing
Parsing is a large topic in terms of both deep theory and innovative tools. Thus, the theoretical and practical aspects are covered in more depth in theory of computation and compiler construction courses, respectively.
We use parsers daily, translating text that we can read and write into data structures tha machines can understand. The range of kinds of parsers is also enormous: from simple regular expression-based pattern matchers that run inside packet filters on the Internet to complex natural language parsers that take in extract structure out of natural language sentences. As such, parsing is arguably one of most successful applications of theoretical computer science into practice.
Parsers for programming languages are usually somewhere in between: they have inductive structure (e.g., matching parentheses) that require more than regular-expression parsers but not as complicated as natural language with all its inherent ambiguities and context-sensitivity (though the syntax of real-world programming languages do sometimes extend beyond context-free).
12.2.1 Top-Down Parsing
There are numerous parsing algorithms with different tradeoffs that are better studied in a compiler construction course. However, all tools or libraries for creating parsers for programming languages generally involve specifying a BNF-like grammar.
Building parsers can get complex quickly even with a parsing library, so we focus here simply on restricted uses of such libraries to build intuitions for context-free grammars.
We consider a kind of library for parsers called combinator parsers. A combinator is a kind of higher-order function (e.g., a function that takes another function as input). A combinator-parsing library is one where the user specifies the grammar simply as calls to the library to build a parser from other parsers, along with callback functions. Combinator parsing libraries generally help us implement some form of recursive-descent parsing, which we can think of simply automating the top-down leftmost parsing derivation and trying productions left-to-right until finding a prefix match.
Thus, they work best (1) when the grammar is unambiguous so that the top-down derivation is deterministic and (2) when there is no left recursion so that each top-down leftmost derivation step makes progress consuming some prefix of the input string.
Note that we gave this same grammar as comments in the Scala code above.
Observe that this grammar given above (Equation 12.3) for JavaScripty with number literals and + expressions is ambiguous. Recall that an ambiguous grammar means there is a sentence in the language described by the grammar with more than one parse tree (or equivalently, more than one parsing derivation). For example, the sentence 1+2+3 (i.e., \(n\)(1\()\)+\(n(\)2\()\)+\(n(\)3\()\) with lexical analysis) has two parse trees with this grammar. It also has left recursion in that the production \[
e\mathrel{::=}e \mathbin{\texttt{+}} e
\] has the \(e\) non-terminal expanding to a sentential form with itself on the left.
12.2.1.1 Left Recursion and Top-Down Parsing
From Section 11.2.3.1, we know how to refactor the grammar to disambiguate for associativity. However, to make the infix binary + operator left associative, the resulting grammar has still left recursion.
Left Recursion and Top-Down Parsing
To use simple recursive-descent parsing and the Scala Parser Combinator library, we cannot use a grammar with left recursion.
Intuitively, left recursion causes an infinite recursion in a simple recursive descent parser because we do not know how far we have to “lookahead” to choose between expanding with a recursive production or a base case production. \[\begin{array}{lcl}
e & \Longrightarrow& e \mathbin{\texttt{+}} e \\
& \Longrightarrow& e \mathbin{\texttt{+}} e \mathbin{\texttt{+}} e \\
& \Longrightarrow& e \mathbin{\texttt{+}} e \mathbin{\texttt{+}} e \mathbin{\texttt{+}} e \\
& \Longrightarrow& \ldots \\
\end{array}\]
12.2.1.2 Restricting the Concrete Syntax
We consider subsequentlyin ?sec-ebnf how we can parse left-associative operators with a recursive-descent parser. Here, we simply restrict the concrete syntax to simplify parsing (i.e., we “cheat” by changing the concrete syntax of the language). For example, for JavaScripty with number literals and addition, we consider the following grammar for the concrete syntax:
with \(t\) as the start symbol. Take special note that the parentheses here \(\texttt{(}\)\(\texttt{)}\) are part of the concrete syntax of the object language. Compare and contrast this restricted, unambiguous grammar with the ambiguous grammar corresponding directly to the abstract syntax (Equation 12.3). Observe that it is similar to the ambiguous grammar in Equation 12.3 but does not accept the same language. Essentially, we have eliminated ambiguity by “forcing” the JavaScripty programmer to write enough parentheses \(\texttt{(}\)\(\texttt{)}\) to state what “grouping” they want. But also note that there is no left recursion in this restricted grammar.
A \(t\) here is what’s called an s-expression (restricted to these particular terminal symbols). S-expressions are commonly used as a serialization format because it is easy to parse. They are used as the concrete syntax for Lisp and Lisp-derived programming languages (e.g., Scheme, Racket), though all operators are written in prefix notation.
12.2.1.3 Implementing a Parser
To connect this restricted grammar to a parser implementation, let us give alternative names to the non-terminal symbols:
We can now implement the restricted grammar (Equation 12.4) directly using the Scala Parsing Combinator Library:
object ExprParser extends scala.util.parsing.combinator.RegexParsers {def term:Parser[Expr]= num ^^{(n:String)=>N(n.toDouble)}|"("~ expr ~")"^^{case _ ~ e ~ _ => e }def expr:Parser[Expr]= term ~"+"~ term ^^{case e1 ~ _ ~ e2 =>Plus(e1, e2)}def num:Parser[String]="""-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?""".rdefparse(str:String): Either[String, Expr]=parseAll(term, str)match{caseSuccess(e, _)=>Right(e)caseFailure(msg, _)=>Left(s"Failure: $msg")caseError(msg, _)=>Left(s"Error: $msg")}}
defined objectExprParser
First, observe that we have translated each non-terminal \(\mathit{term}\), \(\mathit{expr}\), \(\mathit{num}\) into a method term, expr, and num that returns a value of type Parser[A] for some type A, respectively. This structure is indicative of a recursive-descent parser where we define a set of mutually-recursive parsing methods—one for each non-terminal in the grammar of interest.
Then, the scala.util.parsing.combinator.RegexParsers trait provides a number of library methods that we use to define the expr, term, and num methods. These methods, like | and ~, make it look like we are writing a BNF grammar. We see that the | method calls separate grammar productions and the ~ method calls correspond to sequencing symbols on the right-hand side of a production. Ignore the ^^ method calls with the function arguments in {\(\ldots\)} on the right for the moment.
The parameter A to the Parser[A] type is the result type of the parser. For example, the method term:Parser[Expr] returns a parser whose result type is an abstract syntax tree value Expr, while the num: Parser[String] returns a parser whose result type is a String value.
Let us focus on the implementation on the term method and the first part of the implementation:
num ^^{(n:String)=>N(n.toDouble)}|
corresponding to the first production:
\[
\mathit{term} \mathrel{::=}\mathit{num}
\]
To implement this production, this code calls the num method. The num method returns a Parser[String] that recognizes its input as a number, returning the matched String. The ^^ library method call enables specifying a semantic action that translates a Parser[String] to a Parser[Expr] using the function
{(n:String)=>N(n.toDouble)}
of type String=> Expr. In this case, the String given in n is converted to a Double using Scala’s toDouble method and then passed to the N constructor defined above (that is an Expr).
We define an “interface” method parse:String=> Either[String, Expr] that does the parsing by calling parseAll from the library, passing expr as the start symbol and str as the input string to parse.
ExprParser.parse("1")ExprParser.parse("<should not parse>")
We have defined the parse to return an Either[String, Expr]. A Either is used similarly to an Option except when we want to have something more in the None case. In this case, we either give a error message using the Left case or return resulting Expr using the Right case. It is a standard programming convention that the Left case of an Either is generally used for the “error case,” while Right is used for “success.”
12.2.1.5\(\mathit{term} \mathrel{::=}\texttt{(}\; \mathit{expr} \;\texttt{)}\) and \(\mathit{expr} \mathrel{::=}\mathit{term} \mathbin{\texttt{+}} \mathit{term}\)
Now focus on the second part of implementation of the term implementation:
The ~ method produces a Parser[A ~ B] sequencing a Parser[A] and a Parser[B]. In this case, we have a a Parser[String~ Expr ~String] that we translate to a Parser[Expr] using this function:
{case _ ~ e ~ _ => e }
Note that the ~ type constructor is a caseclass defined by the Scala Combinator Parser Library that we can see as simply a custom tuple type.
Also note that the "(" and ")" expressions in the above have type Parser[String]. The Scala Combinator Parsing Library (specifically in RegexParsers) defines implicit conversions that creates a Parser[String] that accepts a single String. Thus, this more specific pattern matching in the semantic action function would be behave the same:
"("~ expr ~")"^^{case"("~ e ~")"=> e }
The expr method definition is now relatively straightforward:
def expr:Parser[Expr]= term ~"+"~ term ^^{case e1 ~ _ ~ e2 =>Plus(e1, e2)}
in that it creates a Plus node out of two Exprs.
We can now test out our parser with + expressions in concrete syntax:
And we can see that a + expression without parentheses fails to parse (as expected):
ExprParser.parse("1 + 2 + 3")
res9: Either[String, Expr] = Left(value = "Failure: end of input expected")
12.2.1.6\(\mathit{num}\)
In the parser grammar from Equation 12.5, we consider \(\mathit{num}\) a terminal and never specified what sentences are in the language of \(\mathit{num}\) (i.e., \(\mathcal{L}(\mathit{num})\)). In this implementation, we use the following regular expression:
(i.e., a value of type Regex in Scala). The RegexParsers trait also defines an implicit conversion that creates a Parser[String] out of a Regex value, which is what we use here.
The following strings are matched by this regular expression: