basetype : TYPE | IDENTIFIER ; type : basetype | basetype parameter_list ; arguments : arraytype IDENTIFIER | arraytype IDENTIFIER COMMA arguments ; argument_list : OPEN_PAREN CLOSE_PAREN | OPEN_PAREN arguments CLOSE_PAREN ; parameters : arraytype | arraytype COMMA parameters ; parameter_list : OPEN_PAREN CLOSE_PAREN | OPEN_PAREN parameters CLOSE_PAREN ; expressions : expression | expression COMMA expressions ; expression_list : OPEN_PAREN CLOSE_PAREN | OPEN_PAREN expressions CLOSE_PAREN ; // just a type that can be an array (this language does not support // multidimensional arrays) arraytype: : type | type OPEN_BRACKET CLOSE_BRACKET ; block : OPEN_BRACE CLOSE_BRACE | OPEN_BRACE statements CLOSE_BRACE ; function_expression : arraytype argument_list block | arraytype IDENTIFIER argument_list block ; call_expression : expression expressions ; index_expression : expression OPEN_BRACKET expression CLOSE_BRACKET ; expression : function_expression | call_expression | index_expression | OPEN_PAREN expression CLOSE_PAREN ; function_statement : arraytype IDENTIFIER argument_list block ; define_clause : IDENTIFIER | IDENTIFIER ASSIGN expression ; define_chain : define_clause | define_clause COMMA define_chain ; define_statement : arraytype define_chain ; statement : function_statement | define_statement SEMICOLON | expression SEMICOLON ; statements : | statement statements ;
Example parses:
// function 'fn' returns a reference to a void, parameterless function void() fn() {} // the parser doesn't know which of these are types and which are variables, // so it doesn't know until the end that this is a call_expression Object(var, var, var, var) // the parser only finds out at the end that this is a function declaration Object(var, var, var, var) fn2() {} (Object(var, var, var, var) ()) (Object(var, var, var, var) () {}) // the parser could possibly detect the "string" type and figure out that // this has to be a define statement or a function declaration, but it's // still ambiguous to the end Object(Object(string, Object), string[]) fn3() {} Object(Object(string, Object), string[]) fn4 = fn3;
My basic approach has been to write functions that could parse the unambiguous components of this subset of the grammar, and then flatten the more complex control flow into individual functional blocks to capture state in function calls. This has proved unsuccessful, what techniques can one use to solve this kind of problem? *There is likely a better word for this
Asked By : skeggse
Answered By : babou
U -> X Y V -> X Y id M -> U id foo N -> V bar Z -> M | N
where id is any identifier, while foo and bar are language keywords. and you have to parse … xxxx yyy very_long_identifier foo …. After reducing xxxx to X and yyy to Y, you have to decide whether to reduce X Y to U, or rather scan the coming identifier. It is quite visible that that depend on the keyword foo or bar that comes after the identifier. This can be easily dealt with with bounded lookahead, as the decisive keyword is only 2 tokens away when the decision is to be made. But this assumes that the identifier is considered as a single keyword. Some people assert that, since CF languages are closed under substitution by regular sets, lexical analysis is not really essential and CF parsers should be able to handle it with the rest of the CF syntax. This is true for a general CF parser (though disputable on practical grounds). But it requires some care and adaptation for classical deterministic parsing technology, because the theoretically unbounded length of identifiers raises lookahead problems. In the example above, if the identifier has not been recognized as such by a lexical analyzer, then it is just an arbitrarily long sequence of alphanumeric tokens, so that the decisive keywords, foo and bar are unboundedly far away. This shows that recognizing an intermediate structure through a pass with a simple finite state device can remove a problem such as unbounded lookahead, at least in some cases, which is all you ask for if you develop by hand a specific parser. If is for you to analyze precisely your grammar to see what can be done, but the above suggest the following approach: You state:
I cannot know definitively whether a given string of tokens represents a function statement, expression, or define statement until near the end of any of those structures in the worst case.
So the question is (up to some probably modifiable details): can you recognize that a sequence of symbols is one of these two constructs of your language, and identify which (without producing a parse-tree) by means of a simple finite state automaton (FSA)? If the answer is yes, than whenever you encounter the beginning of such a sequence, you activate the corresponding FSA to determine quickly which kind of sequence it is, and then you use the result to make your parsing decision.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/44102 3.2K people like this