Regular expressions

Next: Lexical Analysis Up: Finite Automata Previous: Languages recognized by finite automata

Regular expressions

REGULAR EXPRESSION. Let $\Sigma$ be an alphabet such that none of the symbols (, ), |, ^* belongs to $\Sigma$ . A regular expression over $\Sigma$

is a word r over the alphabet $\Sigma$ $\cup$ {(,),|,^*},
denoting (= associated with) a language L(r) over $\Sigma$ and
defined by the following inductive rules.
(R1)

The empty word $\varepsilon$ is a regular expression over $\Sigma$ denoting the language { $\varepsilon$ }.

(R2)

For every x $\in$ $\Sigma$ the letter x is a regular expression over $\Sigma$ denoting the language {x}.

(R3)
Let r and s be regular expressions over denoting languages L(r) and L(s) respectively. Then,
- (r)|(s) is a regular expression over $\Sigma$ denoting the language L(r) $\cup$ L(s),
- (r)(s) is a regular expression over $\Sigma$ denoting the language L(r).L(s),
- (r)^* is a regular expression over $\Sigma$ denoting the language (L(r))^*,
- (r) is a regular expression over $\Sigma$ denoting the language L(r) (i.e., one can add parentheses).

Remark 2 Unnecessary parentheses can be avoided in regular expressions if we adopt the following conventions:

1.: the unary operator ^* has the highest precedence and is left associative,
2.: concatenation has the second the highest precedence and is left associative,
3.: the operator | has the lowest precedence and is left associative.

With these conventions the regular expression (a)|((b)*(c)) is a| b*c.

REGULAR LANGUAGES. Let $\Sigma$ be an alphabet (such that none of the symbols (, ), |, ^* belongs to it). A language $\cal {L}$ over $\Sigma$ is regular if there exists a regular expression r over $\Sigma$ denoting $\cal {L}$ , that is if $\cal {L}$ = L(r).

It follows from Kleene's theorem that a language is recognized by FA iff it is regular.

FROM A REGULAR EXPRESSION TO A FINITE AUTOMATON. We present now a method that for any regular expression r constructs a FA $\cal {A}$ (r) recognizing L(r).

The FA $\cal {A}$ (r) is built inductively by following the structure of the regular expression r.
At each step of the algorithm a FA is produced from previous ones starting with FA of R₀( $\Sigma$ ) and ending with $\cal {A}$ (r).
Every intermediate FA $\cal {A}$ = ( $\Sigma$ , S, I, F, $\delta$ ) produced by the algorithm is NORMALIZED.

The method is quite natural.

1.: We first parse r into its constituent subexpressions.
2.: We construct NFAs for each letter or $\varepsilon$ in r by applying the above rules (R1) and (R2).
3.: Then guided by the syntaxic structure of the regular expression r, we combine the NFAs using rules (R3) and Figures 15, 16, 18.

This method is called Thompson's algorithm.

Remark 3 The number of states of the FA produced by the Thompson's algorithm is upper bounded by 2| r| where | r| is the number of symbols and operators in the regular expression r.

Next: Lexical Analysis Up: Finite Automata Previous: Languages recognized by finite automata

Marc Moreno Maza
2004-12-02