The notion of a grammar

Next: Chomsky classification Up: Grammars and Parsing Previous: An introduction to the notion

The notion of a grammar

A GRAMMAR is quadruple (V_T, V_N, S, P) such that

V_T is a finite set of symbols called terminals,
V_N is a finite set of symbols called non-terminals,
S is a distinguished non-terminal called start-symbol,
P is a finite set of couples (,) called productions where
- $\alpha$ and $\beta$ belong to (V_T $\cup$ V_N)^* and
- $\alpha$ does not belong (V_T)^*.

In other words a production is made of two words over the alphabet V_T $\cup$ V_N such that the first word contains at least one non-terminal symbol. The set V_T $\cup$ V_N is called the set of grammar symbols.

Notation 1 Here are some usual notational conventions about grammars.

It is convenient to denote a production $\alpha$ $\longmapsto$ $\beta$ instead of ( $\alpha$ , $\beta$ ).
By default, the following symbols are terminals: lower-case letters, arithmetic operators, punctuation symbols, digits and boldface strings (if, id, ...).
By default, the following symbols are non-terminals: upper-case letters early in the alphabet such as A, B, C,..., the letter S (for the start-symbol), lower-case italic names such as stmt, expr.
By default, the following symbols are grammar symbols: upper-case letters late in the alphabet such as X, Y, Z
By default, strings of terminals are denoted by lower-case letters late in the alphabet such as u, v, w.
By default, strings of grammar symbols are denoted by lower-case Greek letters early in the alphabet such as $\alpha$ , $\beta$ , $\gamma$ .
By default, the start symbol is the left side of the first production.
If A $\longmapsto$ $\alpha_{{1}}^{{}}$ , A $\longmapsto$ $\alpha_{{2}}^{{}}$ , ... A $\longmapsto$ $\alpha_{{n}}^{{}}$ are all the productions with A as their left side (these productions are called A-productions) then we can write A $\longmapsto$ $\alpha_{{1}}^{{}}$ | $\alpha_{{2}}^{{}}$ | ^... | $\alpha_{{n}}^{{}}$ . The sequence $\alpha_{{1}}^{{}}$ | $\alpha_{{2}}^{{}}$ | ^... | $\alpha_{{n}}^{{}}$ is called the alternatives of A.

Example 2 Using the conventions of Notation 1 the grammar of Example 1 can be stated as follows

E	$\longmapsto$	E A E \| (E) \| -E \| id
A	$\longmapsto$	+ \| - \| * \| / \| $\uparrow$

Remark 2 Grammars offer significant advantages to both language designers and compiler writers. Among them

A grammar gives a precise and easy to understand syntactic specification of a programming language.
From certain classes of grammars we can automatically construct an efficient parser that determines if a source program is syntactically well formed.
Developing a language given by a grammar (adding or removing constructs) is easy.

DERIVATIONS. Let G = (V_T, V_N, S, P) be a grammar and let $\alpha$ and $\beta$ be two strings of grammar symbols for G. We say that $\beta$ derives from $\alpha$ in one step and we write $\alpha$ $\Longrightarrow$ $\beta$ if there exist strings of grammar symbols $\alpha_{{1}}^{{}}$ , $\alpha_{{2}}^{{}}$ $\alpha_{{3}}^{{}}$ , $\beta_{{2}}^{{}}$ , such that

$\alpha$ = $\alpha_{{1}}^{{}}$ $\alpha_{{2}}^{{}}$ $\alpha_{{3}}^{{}}$ ,
$\beta$ = $\alpha_{{1}}^{{}}$ $\beta_{{2}}^{{}}$ $\alpha_{{3}}^{{}}$ ,
$\alpha_{{2}}^{{}}$ $\longmapsto$ $\beta_{{2}}^{{}}$ is a production of G.

The transitive and reflexive closure of the map ( $\alpha$ , $\beta$ ) $\longmapsto$ $\alpha$ $\Longrightarrow$ $\beta$ over the sets of grammar symbols is denoted by ( $\alpha$ , $\beta$ ) $\longmapsto$ $\alpha$ $\;\stackrel{{\ast}}{{\Longrightarrow}}\;$ $\beta$ . Thus for two grammar symbols $\alpha$ and $\beta$ we have

$\displaystyle \alpha$ $\displaystyle \;\stackrel{{\ast}}{{\Longrightarrow}}\;$ $\displaystyle \beta$ $\displaystyle \iff$ $\displaystyle \left\{\vphantom{ \begin{array}{c} {\alpha} = {\beta} \\ {\rm or... ...mma} \ {\rm and} \ {\gamma} {\Longrightarrow} {\beta}) \\ \end{array} }\right.$ $\displaystyle \begin{array}{c} {\alpha} = {\beta} \\ {\rm or} \\ (\exists \, ... ...row} {\gamma} \ {\rm and} \ {\gamma} {\Longrightarrow} {\beta}) \\ \end{array}$

(4)

Intuitively, $\alpha$ $\;\stackrel{{\ast}}{{\Longrightarrow}}\;$ $\beta$ means that $\beta$ derives from $\alpha$ in a finite number of steps. A sequence of strings of grammar symbols ( $\alpha_{{1}}^{{}}$ , $\alpha_{{2}}^{{}}$ ,..., $\alpha_{{n}}^{{}}$ ) is a DERIVATION if we have

$\displaystyle \alpha_{{1}}^{{}}$ $\displaystyle \Longrightarrow$ $\displaystyle \alpha_{{2}}^{{}}$ $\displaystyle \Longrightarrow$ ^... $\displaystyle \Longrightarrow$ $\displaystyle \alpha_{{n}}^{{}}$ .

(5)

THE LANGUAGE GENERATED BY A GRAMMAR. Let G = (V_T, V_N, S, P) be a grammar. The language over V_T generated by G and denoted by L(G) is the set of the words w of V_T^* such that S $\;\stackrel{{\ast}}{{\Longrightarrow}}\;$ w. The words of L(G) are called the sentences of G. More generally any string of grammar symbols $\alpha$ such that S $\;\stackrel{{\ast}}{{\Longrightarrow}}\;$ $\alpha$ is called a sentential form of G.

A language L over the alphabet $\Sigma$ = V_T is said formal if there exists a grammar G such that L is generated by G. Observe that

Some languages are not formal like natural languages.
Two different sequences of derivations-in-one-step can lead to the same sentential form.
Moreover, two different grammars can generate the same language.

These last two observations are illustrated by Example 3.

Example 3 Let us consider again arithmetic expressions. For simplicity we consider only two operations + and *. Our language L can be generated by the grammar G₁:

expr

$\longmapsto$

expr + expr | expr * expr | (expr) | id

And also by the grammar G₂:

expr	$\longmapsto$	expr + term \| term
term	$\longmapsto$	term factor \| factor*
factor	$\longmapsto$	(expr) \| id

Consider now the arithmetic expression $\bf id$ + $\bf id$ * $\bf id$ . Two different sequences of derivations-in-one-step of G₁ can lead to it:

$\displaystyle \begin{array}{lll} expr & \Rightarrow & expr + expr \\ & \Righta... ...f id} * expr \\ & \Rightarrow & {\bf id} + {\bf id} * {\bf id} \\ \end{array}$

$\displaystyle \begin{array}{lll} expr & \Rightarrow & expr * expr \\ & \Righta... ...f id} * expr \\ & \Rightarrow & {\bf id} + {\bf id} * {\bf id} \\ \end{array}$

(6)

This is sketched by the trees of Figure 2, called parse trees.

**Figure 2:** Two parse trees for the same sentence.
$\begin{figure}\htmlimage \centering\includegraphics[scale=.4]{twoParseTrees.eps} \end{figure}$

With G₂ several derivations-in-one-step can lead to $\bf id$ + $\bf id$ * $\bf id$ . However they all correspond to the same tree shown on Figure 3.

**Figure 3:** The parse tree of $\bf id$ + $\bf id$ * $\bf id$ with G₂.
$\begin{figure}\htmlimage \centering\includegraphics[scale=.5]{oneParseTree.eps} \end{figure}$

This notion of a parse tree is formalized later.

NON-TERMINALS can be seen as syntactic variables that denote the sets of strings that can be derived from them. They also impose a hierarchical structure on the language that is useful for both syntax analysis and translation. Moreover Example 3 shows that an accurate use of non-terminals can reduce ambiguity.

Next: Chomsky classification Up: Grammars and Parsing Previous: An introduction to the notion

Marc Moreno Maza
2004-12-02