Lexical Analysis
Key notes
Token class: correspond to sets of strings, identifier(I), operator(O), integer(N), keyword(K) and whitespace(W).
Tokenizer:
 recognize substrings corresponding to tokens(the lexemes).
 identify the token class of each lexeme.
Lefttoright scan => lookahead or unbounded lookahead sometimes required.
Regular languages
Regular expressions
The regular expressions over $\Sigma$ are the smallest set of expressions including.
The regular expressions(the syntax) specify regular languages(the set of strings).
 single character: ‘c’
 epsilon: $\varepsilon$, the empty string
 union: $A+B$
 concatenation: $AB$
 iteration: $A^{*}$, ($A^{*} = \bigcup_{i\geq0}A^i$)
Formal languages
Def. Let $\Sigma$ be a set of characters(an alphabet). A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$.
Meaning function $L$ maps syntax to semantics.

Meaning is many to one. Many syntax to one semantics.
It’s the theoretical support to optimize the program code.

Meaning function makes clear what is syntax, what is semantics. Because expressions and meanings are not 11.
Lexical specifications
Examples of lexical specifications:
Keywords: “if” or “else” or “then” => ‘if’ + ‘else’ + ‘then’
Integer: a nonempty string of digits => digit = ‘0’+’1’+’2’+..+’9’, $digit\,digit^{*}=digit^{+}$
Identifier: strings of letters or digits, starting with a letter => letter = [azAZ], $letter\, (letter+digit)^{*}$
Whitespace: a nonempty sequence of blanks, newlines and tabs => (‘ ‘+’\n’+’\t’)$^{+}$
Number: num = digits opt_fraction opt_exponent, digits = digit$^{+}$, opt_fraction = (‘E’(‘+’ + ‘‘ + $\varepsilon$) digits) + $\varepsilon$ = (‘E’(‘+’ + ‘‘ + $\varepsilon$) digits)?
For programming language
 At least one: $A^{+} \equiv AA^{*}$
 Union: $A \mid B \equiv A+B$
 Option: $A? \equiv A+\varepsilon$
 Range: ‘a’+’b’+…+’z’ $\equiv$ [az]
 Excluded range: complement of [az] $\equiv$ [^az]
After defining all notations, the algorithm to do lexical analysis on whole programming language is:
 Write a rexp for the lexemes of each token class
 Construct R, matching all lexemes for all tokens: R = Keyword + Identifier + Number + …
 Let input be $x_1 x_2 \ldots x_n$, for $1 \leq i \leq n$, check $x_1 \ldots x_i \in L(R)$
 If success, we know $x_1 \ldots x_i \in L(R_j)$ for some $j$.
 Remove $x_1 \ldots x_i$ from the input and go to (3)
To resolve ambiguities, maches the input as long as possible and take care of the rexp priority. To mark up errors(no rule matches), add ERROR rexp, and put it last.
Finite automata
A finite automaton consists of
 An input alphabet $\Sigma$
 A finite set of states $S$
 A start state $n$
 A set of accepting states $F \subseteq S$
 A set of transitions state $\to ^{input}$ state
$\varepsilon$moves:
Deterministic Finite Automata(DFA): One transition per input per state, and no $\varepsilon$moves.
Nondeterministic Finite Automata(NFA): Can have multiple transitions for one input in a given state. Can have $\varepsilon$moves.
DFAs are faster to execute. NFAs are, in general, smaller.
Regular expressions into NFAs
Rexp can be equivalent to NFAs.
 For $\varepsilon$,
 For input a,
 For $AB$,
 For $A+B$,
 For $A^{*}$,
NFA to DFA
$\varepsilon$closure: all states that we can transist to only through $\varepsilon$moves.
Why can we convert NFA to DFA
The difference between NFA and DFA is that DFA only has one state, but NFA has a state set after each input. But NFA only has finite states, it cannot be larger than $2^N1$.
How to convert NFA to DFA
NFA  DFA  

states  S  subsets of S 
start  s  $\varepsilon$closure(s) 
final  $F \subseteq S$  ${X \mid X \cap F \neq \varnothing }$ 
$a(x) =\{y \mid x \in X \cap x \overset{a}\to y \}$  $X \overset{a}\to Y$ if $Y=\varepsilon$closure$(a(X))$ 
Implementing finite automata
A DFA can be implemented by a 2D table (statesinput symbol) T, for every tansition, $S_i \overset{a}\to S_k$ define $T[i,a]=k$