[Help-bash] Posix: 2.3 Token Recognition & 2.10 Shell Grammar

Discussion:

Michael Convey

2015-07-13 15:12:54 UTC

I've read these two sections -- more than once actually, but my
understanding of them is still unsatisfactory. Here are the sources:

- Token Recognition:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_03
- Shell Grammar:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_10

I find the following shell grammar excerpt particularly confusing:

"When a TOKEN is seen where one of those annotated productions could be
used to reduce the symbol, the applicable rule shall be applied to convert
the token identifier type of the TOKEN to a token identifier acceptable at
that point in the grammar. The reduction shall then proceed based upon the
token identifier type yielded by the rule applied."

Is there a book or some other source that provides a layman's exposition
of these two sections?

Eric Blake

2015-07-14 13:01:36 UTC

Permalink

Post by Michael Convey
I've read these two sections -- more than once actually, but my
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_03
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_10
"When a TOKEN is seen where one of those annotated productions could be
used to reduce the symbol, the applicable rule shall be applied to convert
the token identifier type of the TOKEN to a token identifier acceptable at
that point in the grammar. The reduction shall then proceed based upon the
token identifier type yielded by the rule applied."
âIs there a book or some other source that provides a layman's exposition
of these two sections?â

Not that I'm aware of. But I can at least give a layman's shot at
trying to explain the intent:

The shell allows:

case in in in ) echo yes;; esac

which means that the tokenizer cannot blindly treat 'in' as a keyword
everywhere, but only in the places where the keyword is expected (the
third token after seeing 'case' as the first token). So, reading the
grammar, we see (among others):

case_clause : Case WORD linebreak in linebreak case_list Esac

in : In /* Apply rule 6 */

%token In
/* 'in' */

6. [Third word of for and case]

a. [ case only]

When the TOKEN is exactly the reserved word in, the token identifier
for in shall result. Otherwise, the token WORD shall be returned.

So the parser has seen 'case' as Case, the first 'in' as WORD, and is
trying to determine whether the second 'in' fits the rules for
"case_clause". Initially, 'in' is classified as TOKEN, and we are at
the rule for the "in" production, which says to use rule 6 to
disambiguate the token. Rule 6 says that the string "in" is recognized
as a reserved word at this point of context, so the tokenizer
reclassifies from TOKEN to In, and the grammar then accepts the clause
as a valid sequence of tokens. If you do anything else, like:

case in \in in ) echo yes;; esac

you'll get "bash: syntax error near unexpected token `\in'". Or,
applying the same analysis as above, the "in" production applies Rule 6
to the TOKEN of '\in', but since it is not the literal string 'in', it
is not recognized as a reserved word, and is not reclassified, and
therefore the "case_clause" rule is not satisfied and you have a syntax
error.

--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Michael Convey

2015-07-15 02:48:06 UTC

Permalink

Post by Eric Blake
Not that I'm aware of. But I can at least give a layman's shot at
case in in in ) echo yes;; esac
which means that the tokenizer cannot blindly treat 'in' as a keyword
everywhere, but only in the places where the keyword is expected (the
third token after seeing 'case' as the first token). So, reading the
case_clause : Case WORD linebreak in linebreak case_list Esac
in : In /* Apply rule 6 */
%token In
/* 'in' */
6. [Third word of for and case]
a. [ case only]
When the TOKEN is exactly the reserved word in, the token identifier
for in shall result. Otherwise, the token WORD shall be returned.
So the parser has seen 'case' as Case, the first 'in' as WORD, and is
trying to determine whether the second 'in' fits the rules for
"case_clause". Initially, 'in' is classified as TOKEN, and we are at
the rule for the "in" production, which says to use rule 6 to
disambiguate the token. Rule 6 says that the string "in" is recognized
as a reserved word at this point of context, so the tokenizer
reclassifies from TOKEN to In, and the grammar then accepts the clause
case in \in in ) echo yes;; esac
you'll get "bash: syntax error near unexpected token `\in'". Or,
applying the same analysis as above, the "in" production applies Rule 6
to the TOKEN of '\in', but since it is not the literal string 'in', it
is not recognized as a reserved word, and is not reclassified, and
therefore the "case_clause" rule is not satisfied and you have a syntax
error.

Eric, very helpful, thank you. I'm working on a write-up that summarizes
the token recognition and shell grammar sections at a slightly higher level
and in more understandable terms than is provided in the POSIX standard
. Your specific example has helped me better understand the interpretation
and syntax of shell grammar.

Michael Convey

2018-11-25 23:41:36 UTC

Permalink

As a follow up to this old thread, here's my write-up that summarizes token
recognition and shell grammar.

https://docs.google.com/document/d/14JO-CZBlcv3WQFh1998gGwD_XOz5kyhyAiqKGSJlFJI/edit?usp=sharing

Mike

Post by Michael Convey