# Lexical Analysis

## Key notes

Token class: correspond to sets of strings, identifier(I), operator(O), integer(N), keyword(K) and whitespace(W).

Tokenizer:

1. recognize substrings corresponding to tokens(the lexemes).
2. identify the token class of each lexeme. ## Regular languages

#### Regular expressions

The regular expressions over $\Sigma$ are the smallest set of expressions including.

The regular expressions(the syntax) specify regular languages(the set of strings).

• single character: ‘c’
• epsilon: $\varepsilon$, the empty string
• union: $A+B$
• concatenation: $AB$
• iteration: $A^{*}$, ($A^{*} = \bigcup_{i\geq0}A^i$)

## Formal languages

Def. Let $\Sigma$ be a set of characters(an alphabet). A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$.

Meaning function $L$ maps syntax to semantics.

• Meaning is many to one. Many syntax to one semantics.

It’s the theoretical support to optimize the program code.

• Meaning function makes clear what is syntax, what is semantics. Because expressions and meanings are not 1-1.

## Lexical specifications

Examples of lexical specifications:

Keywords: “if” or “else” or “then” => ‘if’ + ‘else’ + ‘then’

Integer: a non-empty string of digits => digit = ‘0’+’1’+’2’+..+’9’, $digit\,digit^{*}=digit^{+}$

Identifier: strings of letters or digits, starting with a letter => letter = [a-zA-Z], $letter\, (letter+digit)^{*}$

Whitespace: a non-empty sequence of blanks, newlines and tabs => (‘ ‘+’\n’+’\t’)$^{+}$

Number: num = digits opt_fraction opt_exponent, digits = digit$^{+}$, opt_fraction = (‘E’(‘+’ + ‘-‘ + $\varepsilon$) digits) + $\varepsilon$ = (‘E’(‘+’ + ‘-‘ + $\varepsilon$) digits)?

#### For programming language

• At least one: $A^{+} \equiv AA^{*}$
• Union: $A \mid B \equiv A+B$
• Option: $A? \equiv A+\varepsilon$
• Range: ‘a’+’b’+…+’z’ $\equiv$ [a-z]
• Excluded range: complement of [a-z] $\equiv$ [^a-z]

After defining all notations, the algorithm to do lexical analysis on whole programming language is:

1. Write a rexp for the lexemes of each token class
2. Construct R, matching all lexemes for all tokens: R = Keyword + Identifier + Number + …
3. Let input be $x_1 x_2 \ldots x_n$, for $1 \leq i \leq n$, check $x_1 \ldots x_i \in L(R)$
4. If success, we know $x_1 \ldots x_i \in L(R_j)$ for some $j$.
5. Remove $x_1 \ldots x_i$ from the input and go to (3)

To resolve ambiguities, maches the input as long as possible and take care of the rexp priority. To mark up errors(no rule matches), add ERROR rexp, and put it last.

## Finite automata

A finite automaton consists of

• An input alphabet $\Sigma$
• A finite set of states $S$
• A start state $n$
• A set of accepting states $F \subseteq S$
• A set of transitions state $\to ^{input}$ state

$\varepsilon$-moves: Deterministic Finite Automata(DFA): One transition per input per state, and no $\varepsilon$-moves.

Nondeterministic Finite Automata(NFA): Can have multiple transitions for one input in a given state. Can have $\varepsilon$-moves.

DFAs are faster to execute. NFAs are, in general, smaller.

## Regular expressions into NFAs

Rexp can be equivalent to NFAs.

• For $\varepsilon$, • For input a, • For $AB$, • For $A+B$, • For $A^{*}$, ## NFA to DFA

$\varepsilon$-closure: all states that we can transist to only through $\varepsilon$-moves.

#### Why can we convert NFA to DFA The difference between NFA and DFA is that DFA only has one state, but NFA has a state set after each input. But NFA only has finite states, it cannot be larger than $2^N-1$.

#### How to convert NFA to DFA

NFA DFA
states S subsets of S
start s $\varepsilon$-closure(s)
final $F \subseteq S$ ${X \mid X \cap F \neq \varnothing }$
$a(x) =\{y \mid x \in X \cap x \overset{a}\to y \}$ $X \overset{a}\to Y$ if $Y=\varepsilon$-closure$(a(X))$

## Implementing finite automata

A DFA can be implemented by a 2D table (states-input symbol) T, for every tansition, $S_i \overset{a}\to S_k$ define $T[i,a]=k$