Ambiguity
With hiking done for the season, I will be posting from time to time on other technical topics that interest me. Maybe they will interest you, as well.
One of these is ambiguity, a subject I’ve been confronted with in my studies of computer languages and of the likely evolution of such languages in the future. I’m going to start with an observation which I will not explain until much latter in this little essay: If there actually was a planet Vulcan, C++ would be one of its street dialects.
Ambiguity arises when a phrase can be interpreted two different ways, both of which are consistent with standard syntax and semantics. For example, I was reading an article about a bar in Washington State that is suing the local city council, on First Amendment grounds, for banning its use of waitresses dressed in bikinis. At one point in the article, the author expressed the view that, while it was not to her taste, she saw no reason for the government to prevent a waitress from serving a drink in a brassiere. The thought ran through my head: “Why would customers want a drink in a brassiere?” The sentence “She saw no reason for the government to prevent a waitress from serving a drink in a brassiere” is ambiguous; the syntax and semantics can reasonably be understood either to mean that the waitress is in a brassiere when she serves the drink, or the drink is in the brassiere when the waitress serves it.
These sorts of things can produce a lot of misunderstanding, and can also be used to humorous effect. Puns rely on semantic ambiguity. A whole series of children’s books (Amelia Bedelia) has been written in which the main character is constantly misunderstanding instructions from her employers in a way that is syntactically and semantically plausible but is not the understanding that would normally occur to an English speaker. (Evil Kent: “And Lewis Carroll was even more fascinated with syntax and semantics than he was in photographing naked children.”)
I hate it when Evil Kent snarks like that, but he’s right. The Jabberwocky is entertaining precisely because its syntax is perfectly sound English syntax, but there are a host of new nouns, verbs, adjectives, and adverbs appearing where our brains expect them. There is a second level of wordplay involved in that most of the new words seem to share roots in existing words that accord with a plausible meaning.
One of my favorite examples of a song parody is of “Achy Breaky Heart.” The original lyrics are:
But don’t tell my heart
My achy breaky heart
I just don’t think he’d understand
And if you tell my heart
My achy breaky heart
He might blow up and kill this man
The parody, which I like considerably better than the original (I’m no country music fan), goes:
But don’t touch my buns
My icky sticky buns
I just don’t think you understand
That if you touch my buns
My icky sticky buns
I might just up and bite your hand
There are many elements underlying the hilarity of this parody — including that the original was dreadful — but an important element is that the two versions are virtually identical in syntax, word for word.
I suppose, for the benefit of the reader who has not diagrammed a sentence in some time, that I should explain my terms. The syntax of a language is the structure of grammatically correct utterances in the language. The semantics is the actual meaning of the words appearing in the utterance, which do not affect its syntax but give the construct meaning. For example,
The quick brown fox jumped over the light brown cat.
has a grammatical structure something like
sentence
subject
noun-clause
ARTICLE (“The”)
adjective-sequence
ADJECTIVE (“quick”)
ADJECTIVE (“brown”)
NOUN (“fox”)
object
VERB (“jumped”)
prepositional-phrase
PREPOSITION (“over”)
noun-clause
ARTICLE (“the)
adjective-sequence
ADJECTIVE (“light”)
ADJECTIVE(“brown”)
NOUN (“cat”)
closing-punctuator
PERIOD
The elements in capital letters are actual words or punctuation marks. These constitute what are called tokens in the grammar; they are the atoms out of which sentences are constructed. Every other construct is an intermediate construct in a grammar specification. The grammar specification we infer here is:
$accept: sentence $end
sentence: subject object closing-punctuator
subject: noun-clause
noun-clause: ARTICLE adjective-sequence NOUN
adjective-sequence: ADJECTIVE
adjective-sequence: ADJECTIVE adjective-sequence
object: VERB prepositional-phrase
prepositional-phrase: PREPOSITION noun-clause
closing-punctuator: PERIOD
These present a context-free grammar in what is called Backus-Naur form. It is composed of rules for constructing various elements of the grammar. The first rule merely says that a valid utterance in the language consists of a sentence. Other rules specify other grammatical constructs, all of which ultimate are built of tokens — words and punctuation marks. Note that there are two rules for constructing an adjective-sequence; either one is valid, and their effect is to formally specify that an adjective-sequence consists of one or more adjectives.
These are not, of course, the only rules for English grammar. A noun clause can omit the adjective sequence. A noun clause can also consist of a single PROPER_NOUN without any article. I’ve left out adverbs completely in this simplified grammar.
Furthermore, the Backus-Naur form assumes a context-free language. Every token has a specific meaning when it is encountered, which does not depend on the greater context in which it is found. English is not a context-free language, nor are any other natural languages, because all have words that can serve (for example) as a noun in one context or a verb in others. For example, finish can be a verb whose semantic meaning is “to complete”, or it can be a noun whose semantic meaning is “the completion point.” Generally we know which is meant from context.
Once you have a grammar, you can construct sentences using any particular tokens you wish, as long as they are of the right class. Take the previous sentence, and replace any article with any other, or any adjective with any other, or any noun with any other, or (almost) any verb with any other, and you still have a valid sentence, albeit with an entirely different meaning. This is the entire basis for the humor of Mad-Libs. Notice my slight qualifier; I have glossed over the fact that there is more than one category of verb, and (for example) reflexive verbs function syntactically in a different way than transitive verbs.
Computer languages, such as assembler, BASIC, FORTRAN, C, C++, Eiffel, Python, Java, and so on, cannot be ambiguous. A program has to have a single meaning on a given computer. These artificial languages therefore must be both syntactically unambiguous and semantically unambiguous. This turns out to be a touch harder than you might imagine.
Assembler has a very simple syntax. BASIC is a bit more complicated; FORTRAN, more complicated still. But all these languages (at least in their earliest forms) can be parsed by a computer using a very fast algorithm called LALR(1), and are therefore described as LALR(1) grammars.
Let me, ahem, parse that phrase, LALR(1), for you. The LR part means that an utterance is parsed from left to right (L) but that constructs are reduced at the rightmost end of the sequence of constructs parsed so far (R) based only on looking at the next token in the sentence (LA and 1).
The algorithm is what computer scientists call a deterministic machine. It consists of a set of states; a stack showing states and constructs we’ve parsed so far; and a table of actions to take depending on what state the machine is in and what the next token (the lookahead token) is. Here’s a computer-generated description of this machine for the very simple English grammar I’ve just presented:
State 0
0 $accept: . sentence $end
ARTICLE shift, and go to state 1
sentence go to state 2
subject go to state 3
noun-clause go to state 4
State 1
3 noun-clause: ARTICLE . adjective-sequence NOUN
ADJECTIVE shift, and go to state 5
adjective-sequence go to state 6
State 2
0 $accept: sentence . $end
$end shift, and go to state 7
State 3
1 sentence: subject . object closing-punctuator
VERB shift, and go to state 8
object go to state 9
State 4
2 subject: noun-clause .
$default reduce using rule 2 (subject)
State 5
4 adjective-sequence: ADJECTIVE .
5 | ADJECTIVE . adjective-sequence
ADJECTIVE shift, and go to state 5
$default reduce using rule 4 (adjective-sequence)
adjective-sequence go to state 10
State 6
3 noun-clause: ARTICLE adjective-sequence . NOUN
NOUN shift, and go to state 11
State 7
0 $accept: sentence $end .
$default accept
State 8
6 object: VERB . prepositional-phrase
PREPOSITION shift, and go to state 12
prepositional-phrase go to state 13
State 9
1 sentence: subject object . closing-punctuator
PERIOD shift, and go to state 14
closing-punctuator go to state 15
State 10
5 adjective-sequence: ADJECTIVE adjective-sequence .
$default reduce using rule 5 (adjective-sequence)
State 11
3 noun-clause: ARTICLE adjective-sequence NOUN .
$default reduce using rule 3 (noun-clause)
State 12
7 prepositional-phrase: PREPOSITION . noun-clause
ARTICLE shift, and go to state 1
noun-clause go to state 16
State 13
6 object: VERB prepositional-phrase .
$default reduce using rule 6 (object)
State 14
8 closing-punctuator: PERIOD .
$default reduce using rule 8 (closing-punctuator)
State 15
1 sentence: subject object closing-punctuator .
$default reduce using rule 1 (sentence)
State 16
7 prepositional-phrase: PREPOSITION noun-clause .
$default reduce using rule 7 (prepositional-phrase)
This was generated by a tool called bison, which can take a grammar specification in Backus-Naur form and turn it into a computer program to parse the specified grammar using the LALR(1) algorithm.
The way the algorithm operates is fairly simple, though it takes some explanation. The algorithm starts with an initial stack (0, $begin). This means it is in state 0 and the associated construct is $begin, an artificial construct meaning the beginning of a sentence. The algorithm sees that the first token of the sentence we’re parsing is ARTICLE (“The”). The directions are to shift, and go to state 1. This means we pull ARTICLE off the front of the sentence and push it onto the stack along with the new state, 1. The stack now reads (0, $begin) (1, “The”).
The algorithm is now in state 1, the state at the end of the stack. The lookahead token is now ADJECTIVE (“quick”). The algorithm shifts 5 and ADJECTIVE onto the stack, which now reads (0, $begin) (1, “The”) (5, “quick”).
We’re now in state 5, and the lookahead is ADJECTIVE. We shift and remain in state 5; the stack is now (0, $begin) (1, “The”) (5, “quick”) (5, “brown”). The next token is NOUN; since this is not ADJECTIVE, which is the only lookahead for which we have a specific action, we do the default action of reducing the stack. This is the trickiest part of the algorithm. We pop the last state off the stack, which now reads (0, $begin) (1, “The”) (5, “quick”). We next convert the ADJECTIVE we just popped off the stack to an adjective-seq, treat it temporarily as if it was the lookahead token, treat the state at the end of the stack (5) as if it was temporarily our state, and see that we should push (10, adjective-seq) onto the stack. We are now in state 10 with the stack (0, $begin) (1, “The”) (5, “quick”) (10, adjective-seq).
The lookahead token is still NOUN. We see that we are to pop the last two constructs off the stack, reduce these to adjective-sequence, and consult the stack, which now reads (0, $begin) (1, “The”). We treat the adjective-sequence we just constructed as if it was the next token, 1 as if it was our current state, and shift the adjective-sequence and the state 6 onto the stack: (0, $begin) (1, “The”) (6, adjective-sequence).
We’re in state 6 and the lookahead token is NOUN. We shift and go to state 11: (0, $begin) (1, “The”) (6, adjective-sequence) (11, NOUN). We’re now in state 11, and the lookahead is VERB. The default action is to reduce the last three items on the stack to noun-clause (stack is now (0, $begin)), treat noun-clause as our lookahead for state 0, and see that we should go to state 4. Our stack is now (0, $begin) (4, noun-clause) and the lookahead is PREPOSITION.
I won’t belabor this further. I leave the rest as an exercise for the student. The key thing is that this algorithm very efficiently parses the tokens in a sentence into intermediate constructs, then puts these together into larger constructs, until the final $begin sentence $end is reached and we are done. Or we get into a state where there is no applicable rule; that means there’s a syntax error in the sentence and we reject it.
Our grammar is unambiguous. Every valid sentence, and only valid sentences, will be accepted by our parser; and they will be parsed in exactly one way.
Now consider that “light” can be an adverb as well as an adjective. We can interpret “light brown cat” to mean either a cat that is light brown in color, or a cat that is light and is brown. This kind of thing comes up in every natural language, but is a problem for computer languages, which can be assigned only a single meaning. When we speak English, we almost always parse it the first way. In effect, our brains have a disambiguation rule built into them that tells us which of two alternate readings is acceptable. This rule is based on both syntax and semantics; we naturally prefer strings of adjectives to adverb-modified adjectives, and we also recognize that foxes can be quick, but brown that is quick makes no semantic sense.
Consider this grammar:
$begin a $end
a: b c
b: A
b: B
c: B
c: C
This is an ambiguous grammar. B can serve both as a b and as a c, which both reduce to an a. Submit this to Bison, and you will be told:
warning: 1 reduce/reduce conflict
This is an indication of possible ambiguity. The states generated by bison are:
State 0
0 $accept: . a $end
A shift, and go to state 1
B shift, and go to state 2
C shift, and go to state 3
a go to state 4
b go to state 5
c go to state 6
State 1
3 b: A .
$default reduce using rule 3 (b)
State 2
4 b: B .
5 c: B .
$end reduce using rule 4 (b)
$end [reduce using rule 5 (c)]
$default reduce using rule 4 (b)
State 3
6 c: C .
$default reduce using rule 6 (c)
State 4
0 $accept: a . $end
$end shift, and go to state 7
State 5
1 a: b .
$default reduce using rule 1 (a)
State 6
2 a: c .
$default reduce using rule 2 (a)
State 7
0 $accept: a $end .
Notice that state 2 has conflicting rules:
$end reduce using rule 4 (b)
$end [reduce using rule 5 (c)]
Only one can be built into the algorithm. bison chooses the first rule as the one to use; the second will never be applied.
This is a genuine ambiguity. A similar ambiguity exists in C++:
int main()
{
A * B();
/* more stuff */
}
The question becomes: Is the statement A * B() declaring a function, B, that returns a pointer to the named type, A; or is it telling us to multiply the variable A by the result of calling the function B with no arguments? From a syntactic of view, both are valid interpretations. So C++ has a disambiguation rule: If a construct can be interpreted either as a declaration or as an expression, it is to be interpreted as a declaration. C++ also “cheats” by looking at what kinds of symbols A and B are, and discarding alternatives that don’t work semantically; this is done through something called the scanner hack, where tokens are identified based on previously parsed code rather than just the spelling of the token. That this is called a hack tells you that programming purists really don’t like it; it blurs the distinction between syntax and semantics.
Here’s the thing: In the original C language, you would have to write this as
int main()
{
struct A * B();
/* more stuff */
}
or as
int main()
{
extern A * B();
/* more stuff */
}
for it to be interpreted as a declaration. “struct” disambiguates A, making it clear that it is a type name. “extern” disambiguates the entire statement, by making it clear that it is a declaration. But this is wordy, and therefore cumbersome for programmers. When you add the scanner hack, where A that is a previously declared type is given a different token type from A that has not previously been declared as a type, you end up with a grammar that is (or easily can be made) LALR(1). Formally, it isn’t, because the scanner hack makes the grammar a context-sensitive grammar; but this seems a small price to pay for sparing programmers having to add lots of “struct” or “extern” keywords to disambiguate the code.
C++ is infinitely worse. It introduces constructors, which are invoked in expressions that look like function calls using the name of a type as the function name. This makes it that much harder to distinguish expressions from declarations. It introduces templates, which use < and > to delimit template arguments, which become ambiguous when the argument may be an expression — and both < and > are used in expressions. The result is that C++ is an absolute mess to parse. Even today, the most popular compiler for C++ cannot correctly parse the program:
template <bool> class A {};
const int a = 0, b=1;
A<a<b> c;
A<a>b> d;
The compiler chokes on the last line. This should be interpreted as declaring d to be of type A<false>. But the compiler mistakes the first > for the closing delimiter of the template argument list, looks for an initializer or closing ‘;’ after b, and dies when it fails to find one.
What’s required to parse C++? First, you need a way to parse all the possibilities; then a way to choose the correct one. The former may be accomplished a number of ways, but one is to use a gLR algorithm. This algorithm can purportedly parse any context-free grammar and return every legal possibility when there is an ambiguity. It can also look ahead as far as necessary to rule out possibilities. Whereas LALR(1) looks ahead just one token, gLR can, in principle, look at the entire remainder of the program to sort things out. gLR does this by parsing as if it was LALR(1) until it hits an ambiguity; then it splits its stack and parses the two possibilities in parallel. Each stack can be split again, if yet another ambiguity is reached. If a stack gets into a state where there is no rule for the lookahead token, it “dies.” Eventually you either have all the stacks die, in which case the program is in error and is rejected; or exactly one stack is left, and you accept its parse and carry on from there; or there are multiple stacks left but all have “come back together” in the same final state, in which case you have genuine ambiguity. The algorithm then must have been provided disambiguation rules for choosing the right one, or it can pass all the possibilities to a function written by the programmer that examines each parse and decides which can be ruled out.
Consider this language:
$accept: a $end
a: b
a: c
b: d { (
c: e { {
d: A
e: A
This is a contrived example, but useful to illustrate an LALR(1) ambiguity that is not a real ambiguity. Both A { ( and A { { are valid sentences in this grammar, and there is no ambiguity in the parse. But LALR(1) is not powerful enough to parse it. bison returns a single reduce-reduce conflict:
State 1
5 d: A .
6 e: A .
‘{‘ reduce using rule 5 (d)
‘{‘ [reduce using rule 6 (e)]
$default reduce using rule 5 (d)
What this is saying is that, when the parser has scanned the first A and sees the {, it must reduce the A to either d or e. The correct choice depends on whether the ‘{‘ is followed by ‘{‘ or ‘(‘. But that requires more than one token of lookahead — at least an LALR(2) parser, for example — so the parser guesses it’s d, which means the sentence is parsed incorrectly if it should have been e, even though the sentence is valid and unambiguous, with the LALR(1) parser.
gLR can handle this. It splits the stack and parses on either assumption. The stack for d dies if the ‘{‘ is followed by ‘{‘ and only the other stack is left — which is then accepted by the algorithm as the correct parse.
Could C++ be tweaked to be an LALR(1) language? It turns out the answer is yes, at least if the doctoral research of a couple of graduate students is to be believed. One approach is to keep the language as it is, but tweak the grammar to directly reflect ambiguities. This is apparently possible, but makes for an enormously complex and counterintuitive LALR(1) grammar, and some of the parsed elements require careful analysis to determine what they really mean. Thus, the A * B() ambiguity would be resolved by parsing this as a “declaration-or-expression” construct, then letting the next stage of the compiler (the semantic analyzer) decide which it is. It’s not clear this actually gains you anything.
The other approach is to change C++ itself into an easily parsed language. I’ve already mentioned one thing that would help: Requiring all declarations to start with a storage class, such as “extern.” This is unpleasantly wordy. You’d also have to use something other than < and > to delimit template arguments. The problem here is that there are only so many balanced delimiters, as we call them, on a standard keyboard: (), [], <>. The first is reserved for function calls and nesting of expressions. The second is reserved for array subscripting. The last is used for expressions, to denote greater-than or less-than. You’d be forced to use a double delimiter, say, “<{” and “}>”. That would nicely eliminate the ambiguity, but it’s ugly.
Bottom line? C++ is hard to parse, because it takes “shortcuts” that programmers like. The burden is shifted from C++ programmers, a relatively large group of people, to C++ compiler writers, a very small group of people. In many respects, the evolution of C++ is looking more and more like that of a natural language — new constructs are being invented and added, shoehorned in in some way that is logical but adds yet more burdens to the compiler writers. And we all seem okay with this, because compiler writers are few and very good at what they do, and we like our friendly, backwards-compatible language.
And so, back to my comment at the very start of this article: If there really was a planet Vulcan, C++ would be one of its street dialects. It’s a perfectly logical language with no ambiguity that cannot be resolved by disambiguation rules, and its “natural” feel makes up for its messiness. If Vulcan has a high language, equivalent to high Latin (not spoken in the streets of Vulcan, but used in the Vulcan Senate and in official documents) it is a cumbersome but grammatically elegant (that is, LALR(1)) version of C++.
I see no reason to think the evolution of computer languages will go in any other direction. Were it so, I believe that Eiffel, an alternative to C++ with a rigorous (and LALR(1)) grammar, would have a far greater popularity than it does. In fact, I think this conclusion is unambiguous.
It’s fortunate the fox isn’t colored lime instead of brown.
Just so.
Is your compiler in C++0x mode? Because my understanding of C++0x rules is that they require the first non-nested > to be read as as the end of the template argument list, to do what you wanted would need parentheses:
Ab)> d;
lol – and the comment parser blew up my code!! Read it as a tag!! Talk about ironic.
Basically use (a>b) rather than a>b.
I wondered if there was a rule that the first non-nested > was to be read as the end of the template argument list. Based on my experiments with C++ parsers, it would certainly make parsing easier for compiler writers. But I could not find such a rule in a quick skim of the C++11 standard — admittedly, a massive document.
https://stackoverflow.com/a/15785583
Thanks.