[Solved]: Prove that regular expression is unambiguous

Problem Detail: I’ve got following definition: Function $f$ is a valid mapping of word $w$ to regular expression $R$, if any of following conditions is true:

$R = w$ and $f$ is the identity or $R = epsilon$ and $w = epsilon$;
$R = R_1 + R_2$ and $f$ is a valid mapping of $w$ to $R_1$ or $f$ is valid mapping of $w$ to $R_2$;
$R = R_1R_2$, and $w$ can be written as $w_1w_2$ such that $f$ limited to $w_1$ is a valid mapping of $w_1$ to $R_1$, and $f$ limited to $w_2$ is a valid mapping of $w_2$ to $R_2$; or
$R = S^*$ and $w$ can be written $w = w_1w_2dots w_k$ for some $k geq 0$ and, for all $1leq ileq k$, $f$ limited to $w_i$ is a valid mapping of $w_i$ to $S$.

A regular expression $R$ is deterministic (unambiguous) if, for every $win L(R)$, there exists exactly one valid mapping from $w$ to $R$. I need to prove that the regular expression $R = 0^*(11^*000^*)^*11^*01(1+0)^*$ is deterministic. I think that first I should prove that there exists a valid mapping for every word $w$ to $R$ and then assume that there’s a different mappings and lead to a contradiction. But I’ve got problems with details (especially with the second part of the proof).

Asked By : ciechowoj

Answered By : babou

This answer has 3 parts:

I first answer the question as asked, sketching the beginning of a proof concerning the example given. A full proof is too long and tedious to be given here.
Then I give a general procedure to solve the problem and decide whether a regular expression is deterministic/unambiguous.
Finally I discuss some difficulties and problems with the statement of the question.

A direct answer to the question

What you may first prove, though it should be obvious, is that for any regular expression $R$, there is a valid mapping of any word $win L(R)$ to $R$. Then you consider the regular expression $R = 0^*(11^*000^*)^*11^*01(1+0)^*$ to be analyzed, and you procede to decompose it structurally, showing that at every step, there is only one way to decompose a string $win For example, you show that there is only one way to decompose a string $win L(R)$ as $w=uv$ with $uin L(0^*)$ and $vin L((11^*000^*)^*11^*01(1+0)^*)$. The reason is that $u$ contains only $0$, and $v$ begings necessarily with a $1$. So the string $w$ must be cut just before the first one. Then there is only one way to get a valid mapping for the whole expression, from valid mapping of the two subexpressions. If the subexpressions are themselves proved detrministic, i.e. with only one valid mapping fo any string in their language, then the whole expression must be deterministic. Hence you now have to prove only that, $0^*$ and $(11^*000^*)^*11^*01(1+0)^*$ are deterministic, since their concatenation has a unique decomposition. So you go on decomposing the expression into subexpression, such that you have a unique way of matching them with a string of the language. The next step is to show there is only one way to decompose a string $w$ in $L((11^*000^*)^*11^*01(1+0)^*)$, as $w=u01v$, with $uin L((11^*000^*)^*11^*)$ and $vin L((1+0)^*)$ And so on … For each subexpression, you must perform the analysis according to the type of the subexpression, following the definition given for each of the four types in the definition of valid mapping. In the case of $R=S+T$, where $S$ and $T$ are subexpressions of $R$, you only have to prove that $L(S)cap L(T)=emptyset$. This gives you the general idea of a proof on a given example, such as the one above. However, such a proof is very tedious and prone to errors. So you would be better off designing an algorithm that will decide mechanically whether a regular expression is ambiguous or not. The problem is decidable, and not so hard. I follows the same lines I sketched: recursive decomposition of the regular expression, which things to be checked depending on the operator (concatenation, union, Kleene star).

A procedure to decide ambiguousness of a regular expression.

To decide ambiguousness of a regular expression, one must recursively decompose the expression into 1 or 2 subexpressions dominated by an operator, down to the leaf level. Given a regular expression $R$,

If $R=win Sigma+$, then $R$ is unambiguous;
If $R=R_1+R_2$, then $R$ is unambiguous iff both $R_1$ and $R_2$ are unambiguous, and $R_1cap R_2=emptyset$. If the intersection is not empty, any valid mapping from string to $R_1$ could instead map validly to $R_2$; hence there would be two valid mappings. Also, if there were two distinct valid mappings to one daughter, say $R_1$, then by application of rule 2 of the definition, there would be two valid mappings to $R$.
If $R=R_1 R_2$, then $R$ is unambiguous iff both $R_1$ and $R_2$ are unambiguous, and there is only one way to decompose any string $win L(R)$ into 2 substrings as $w=w_1 w_2$ such that $w_1in L(R_1)$ and $w_2in L(R_2)$. $R_1$ and $R_2$ must be both unambiguous because, as before, if one were ambiguous, say $R_1$ for a string $w_1$, if would be possible to build two valid mappings for $R$, which would thus be ambiguous. But how does one check that string $win L(R)$ can be decomposed in only one way into two substrings $w=w_1 w_2$ with $w_iin L(R_i)text{ for }i=1,2$? This can be checked by attempting to simulate 2 simultaneous generations of a string in $L(R)$, such that 2 distinct prefix of the substring will be associated to $R_1$. Let $A_1$ and $A_2$ be two FA recognizing $L(R_1)$ and $L(R_2)$, having (to simplify) one start state and one accepting state. A FA for $L(R)$ can be obtained by linking then with an empty transition from the accept state of $A_1$ to the start state of $A_2$. Then one can build a new FA $A^2$ by doint a cross product of the state set of $A$ with itself to simulate 2 computations at the same time. Both computations must be synchronized so that they scan the input together, though not necessarily following the same computational paths for each simulated automaton. Both computations must at some point cross the linking transition. This is memorized in the final state so that the double computation ends in acceptance iff the both reach the final state of $A_2$, but after crossing the linking transition at different points in the input. This automaton $A^2$ recognizes the language ${win L(R)mid w=uv=u’v’ wedge uneq u’ wedge u,u’in L(R_1) wedge v,v’in L(R_2)$ Hence, strings in $L(R)$ can be decomposed unambiguously in two strings belonging respectively to $L(R_1)$ and $L(R_2)$ iff $L(A^2)=emptyset$.
If $R=S*$, then $R$ is unambiguous iff $S$ is unambiguous, and there is only one way to decompose any string $win L(R)$ into a concatenaton of substrings in $L(S)$. The technique is similar to the one used for concatenation. You take a FA $A_S$ recognizing $L(S)$, with a single accept state, and you connect the accept state to the start state with an empty “looping” transition, to get a FA $A$ for $L(R)$. Then, you buils a FA $A^2$ that simulates two computations of $A$ and accepts only when they both accept, but with the constraint that they must have crossed the looping transition at least once while being at different points of the input (or possibly a different number of times). Again, strings in $L(R)$ can be decomposed unambiguously in a sequence of strings in $L(S)$ iff $L(A^2)=emptyset$.

Caveats

Some care must be taken with the empty word $epsilon$. It is not clear that the problem is always well defined in this respect. The empty word $epsilon$ may be used in regular expressions but will have no apparent counterpart in strings of the language. They can be ignored, But then, what of a Kleene star $R=S^*$ when $epsilonin L(S)$. Should it be considered always ambiguous. Another point is that the concept of valid mapping is really a mapping of occurrences of words to occurrences of regular expressions, i.e., to the syntactic instances, rather than to regular expressions as abstract concepts. For example, consider the string $w=011001$ and the regular expression $R=(01)^*100(111)^*1$. There is a valid mapping $f$ of $w$ to $R$. However it associates the first instance of $01$ with $(01)^*$, and the second instance of $01$ with $0(111)^*1$. What is worse, even considering occurences, $f$ is not a mapping since it associates the first instance of $01$ with both $(01)^*$ and the regular expression $01$ to which the Kleene star is applied. Hence $f$, as it is defined, cannot be considered a mapping. Actually, it seems that $f$ should be defined as a mapping of $R$ and each subexpression occurrence in $R$ to substrings of $w$. But up to that rephrasing of the problem, everything holds.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/40297