% Copyright 2012-2024, Alexander Shibakov
% This file is part of SPLinT
%
% SPLinT is free software: you can redistribute it and/or modify
% it under the terms of the GNU General Public License as published by
% the Free Software Foundation, either version 3 of the License, or
% (at your option) any later version.
%
% SPLinT is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
% GNU General Public License for more details.
%
% You should have received a copy of the GNU General Public License
% along with SPLinT. If not, see .
\input limbo.sty
\input frontmatter.sty
\def\optimization{5}
\input yy.sty
\modenormal
% multi-column output
\input dcols.sty
\topskip=9pt % this is a purely aesthetic choice, also negating the \raggedbottom
% option in cwebmac.tex
% set the typesetting of various token groups
\let\currentparsernamespace\parsernamespace
\let\currenttokeneq\tokeneq
\let\parsernamespace\mainnamespace
\let\hostparsernamespace\mainnamespace % for \nameproc in tokeneqpretty
\let\tokeneq\tokeneqpretty
\let\optstrextra\optstrextraesc
%\traceprettytokenstrue
\input bo.tok % re-use token equivalence table to set the typesetting of tokens
\input btokenset.sty % adjust the typesetting of some tokens
\let\parsernamespace\flexnamespace
\let\hostparsernamespace\flexnamespace
\input fo.tok
\input ftokenset.sty % ... for the flex input lexer
\let\parsernamespace\flexpseudorenamespace
\let\hostparsernamespace\flexpseudorenamespace
\input fretokenset.sty % regular expression names
% index entries
\let\parsernamespace\indexpseudonamespace
\input yypretty.sty
\prettywordpair{TOKEN}{{\tt TOKEN} {\rm(example)}}%
\prettywordpair{token}{{\tt "token"} {\rm(example)}}%
\let\tokeneq\currenttokeneq
\let\parsernamespace\currentparsernamespace
\let\hostparsernamespace\mainnamespace % the namespace where tokens are looked up
% by \nameproc and friends for typesetting purposes
%
\immediate\openout\exampletable=\jobname.exl % file for parser output examples
\def\nontitle#1{{\ttl #1}}
\def\cite[#1]{%
\def\next{#1}\setbox0=\hbox{l}%
[\ifx\next\empty$\,$\hbox{\vrule width\wd0 height\ht0 depth\dp0}$\,$\else \locallink{#1bibref}#1\endlink\fi]%
}
\let\N\textN
\let\N\chapterN
\let\M\textM
\defreserved{Y}{\.{Y}}
\showlastactiontrue
\initauxstream
@**Introduction.
\splint\footnote{I was tempted to call the package {\tt ParLALRgram}
which stands for Parsing {\sc LALR} Grammars or {\tt PinT} for
`Parsing in \TeX' but both sounded too generic.} (Simple Parsing and
Lexing in \TeX, or, following the great GNU
tradition of creating recursive names, \splint\ Parses Languages
in \TeX) is a system (or
rather a m\'elange of systems) designed to
facilitate the development of parsing macros in \TeX\ and (to a lesser
degree) to assist one in documenting parsers written in other languages. As
an application, parsers for \bison\ and \flex\ input file syntax have been
developed, along with a macro collection that makes it possible to
design and pretty print\footnote{The term {\it pretty printing\/} is used here in
its technical sense as one might find that there is nothing pretty about
the output of the parsing routines presented in this document.}
\bison\ grammars and \flex\ automata using \CWEB. The \.{examples}
directory contains a few other parsers designed to pretty print
various languages (among them is \ld, the language of the \GNU\
linker).
@s TeX_ TeX
@s TeXa TeX
@s TeXb TeX
@s TeXf TeX
@s TeXfo TeX
@s TeXao TeX
@*1 \eatone{CWEB}\CWEB\ and literate programming.
Writing software in \CWEB\ involves two programs. The first of these is
\CTANGLE\ that outputs the actual code, intended to be in
\Cee. In reality, \CTANGLE\ cares very little about the language it
produces. Among the exceptions are \Cee\ comments and |@[#line@]| directives that might
confuse lesser software but \bison\ is all too happy to swallow them
(there are also some \Cee\ specific constructs that \CTANGLE\ tries to
recognize). \CTANGLE's main function is to rearrange the text of the
program as written by the programmer (in a way that, hopefully,
emphasizes the internal logic of the code) into an appropriate
sequence (e.g.~all variable declaration must textually precede their
use). All that is required to adopt \CTANGLE\ to produce \bison\
output is some very rudimentary post- and pre-processing.
Our main concern is thus \CWEAVE\ that not only pretty prints the
program but also creates an index, cross-references all the
sections, etc. Getting \CWEAVE\ to pretty print a language other than
\Cee\ requires some additional effort. A true digital warrior would
probably try to decipher \CWEAVE's output `in the raw' but, alas, my
(C)WebFu is not that strong. The loophole comes in the form of a rarely
(for a good reason) used \CWEB\ command: the verbatim (\.{@@=...@@>})
output. The material to be output by this construct undergoes minimal
processing and is put inside \.{\\vb\{}$\ldots$\.{\}}. All that is
needed now is a way to process this virtually straight text inside \TeX.
This manual, as well as nearly every other document that accompanies
\splint\ is itself a source for a computer program (or, as is the case
with this document, several programs) that is extracted using
\CTANGLE. We refer an interested reader to \cite[CWEB] for a detailed
description of the syntax and use patterns of \CWEB. The following is
merely a brief overview of the approach.
Every \CWEB\ document is split into {\em sections}, each divided into
three parts (any one of which can be empty): the \TeX\ part, the middle part, and
the \Cee\ part (which should more appropriately be called the {\em code\/} or
the {\em program\/} part). The \Cee\ part of each\footnote{With the exception of the
nameless \.{@@c} (or \.{@@p}) sections.} section carries a name for cross referencing
purposes. The sections themselves are automatically numbered by \CWEAVE\ and
their code parts may be referenced from other sections, as well
as included in other sections' code parts using \CWEB's cross referencing
syntax (such as |@|). Using the same name for the \Cee\
portion in several sections has the effect of merging the
corresponding code fragments. When the section with such a name is
used (included) later, all of the concatenated fragments are included
as well, even the ones that appear after the point in the \CWEB\
document where such inclusion takes place.
The original \CWEB\ macros (from \.{cwebmac.tex}) used the numbers
generated by \CWEAVE\ to refer to specific sections. This was true
for the table of contents, as well as the index entries. The macros
used by \splint\ adopt a different convention, proposed by N.~Ramsey
for his literate programming software \noweb. In the new system
(which will be referred to as the \noweb\ style of cross referencing),
each section is labelled by the page number where it starts and an
alphabetic character that indicates the order of appearance of the
section on the page. Also following \noweb, the new macros display
links beween the fragments of the same section in the margins. This
allows for quicker navigation between sections of the code and lets
the reader to get a quick overview of what gets `collected' in a given
section.
The top level (\.{@@**}) sections, introducing
major portions of the code have also been given more prominent
appearance. They display the chapter number using a large typeface
and omit the marginal section reference. References to such sections
are typeset as {\it cnnn\/} where {\it nnn\/} represents the chapter
number. While such references obscure the page number, hopefully one keeps
the number of chapters, as well as such references, small. This
modification improves the appearance of the first chapter pages.
\CWEB\ also generates an {\em index\/} of all the identifiers (with
some exceptions, such as single letter names) appearing in the
\Cee\ portion of each section, {\em except\/} those that appear inside
the {\em verbatim@^verbatim block@>\/} portions of the code
(i.e.~between \.{@@=} and \.{@@>}). Since \splint\ uses the verbatim blocks
extensively, additional indexing facilities have been implemented to
provide indexing for the non-\Cee\ languages handled by various
\splint\ parsers.
@*1 Pretty (and not so pretty) printing.
Pretty-printing can be narrowly defined as a way to organize the
presentation of the program's text. The range of visual devices used
for this purpose is usually limited to indentation and discrete line
skips, to mimic the capabilities of an old computer terminal. Some
authors (see~\cite[ACM]) have replaced the term pretty printing with
{\em program visualization\/} to refer to a much broader range of
graphic tools for translating the code (and its meaning) into a richer
medium. This manual uses the terms {\em pretty printing\/} and {\em
program visualization\/} interchangeably.
Pretty printing in the broader sense above has been the subject of
research for some time. The monograph~\cite[ACM] develops a methodical
(if not formalized) approach to the design of visualization frameworks
for programming languages (although the main focus is on procedural
\Cee-like languages).
A number of papers about pretty printing have appeared since,
extending the research to new languages, and suggesting new
visualizatin rules.
Unfortunately, most of this research is driven by rules of
thumb and anecdotes (the approach fully embraced by this manual),
although there have been a few rigorous studies investigating
isolated visualization techniques (see, for example, the discussion
of variable declaration placement in~\cite[Jo]).
Perhaps the only firm conclusion one can draw from this discussion is
that {\em writing\/} the code and {\em reading\/} it are very different
activities so facilitating the former may in turn make the latter more
difficult and vice versa. Some well known languages try to arrive at a
compromise where the syntax forces a certain style of
presentation on the programmer. An example of a successful language in
this group is Python with its meaningful white space. The author does
not share the enthusiasm some programmers express for this approach.
On the other hand, a language like \Cee\ does not enforce any
presentation format\footnote{The `feature' so masterfully exploited by
the International Obfuscated \Cee\ Code Contest ({\sc IOCCC})
participants.}. The authors of
\Cee\ even remarked that semicolons and braces were merely a nod to
the compiler (or, one might add, static analysis software,
see~\cite[KR]). It may thus seem reasonable that such redundant
syntax elements may be replaced by different typographic devices (such
as judicially chosen skips and indentation, or the choice of fonts)
when (pretty) printing the code.
Even the critics of pretty printing
usually concede that well indented code is easier to read. The practice
of using different typefaces to distinguish between various
syntactic elements (such as reserved words and general identifiers) is
a subject of some controversy, although not as pronounced as some of
the more drastic approaches (such as completely replacing the brace
pairs with indentation as practiced by \splint\ for \bison\ input or
by the authors of~\cite[ACM] for the control statements in \Cee).
The goal of \splint\ was not to force any parcticular `pretty printing
philosophy' on the programmer (although, if one uses the macros `as
is', some form of quiet approval is assumed $\ldots$) but rather to
provide one with the tools necessary to implement one's own vision of
making the code readable.
One tacit assumption made by the author is that an integral part of
any pretty printing strategy is extracting (some) meaning from the raw
text. This is done by {\em parsing\/} the program, the subject we
discuss next. It should be said that it is the parser design in \TeX\
that \splint\ aims to facilitate, with pretty printing being merely an
important application.
@*1 Parsing and parsers.
At an abstract level, a {\em parser@^parser@>\/} is just a routine
that transforms text. Naturally, not every possible transformation is
beneficial, so, informally, the value of a parser lies in its ability
to expose some {\em meaning\/} in the text. If valid texts are reduced
to a small finite set (while each text can be arbitrarily long) one
can concievably write a primitive string matching algorithm that
recognizes whether any given input is an element of such set, and if
it is, which one. Such `parsers' would be rather limited and are only
mentioned to illustrate the point that, in general, the texts being
parsed are not required to follow any particular specifiction.
In practice, however, real world parsers rely on the
presence of some structure in the input to do their work. The latter
can be introduced by supplying a formal (computable) description of
every valid input. The `ridgidity' of this specification directly
affects the sophistication of the parsing algorithm required to
process a valid input (or reject an invalid one).
Parsing algorithms normally follow a model where the text is processed
a few symbols at a time and the information about the symbols
already seen is carried in some easily accessible form. `A few symbols at a time'
often translates to `at most one symbol', while `easily accessible'
reduces to using a stack-like data structure for bookkeeping.
A popular way of specifying {\em structure\/} is by using a {\em
formal grammar@^grammar@>}\footnote{While popular, formal grammars are
not the only way of describing a language. For example, `powers of $2$
presented in radix $3$' is a specification that cannot be defined by a
context-free grammar, although it is possible to write a (very complex)
grammar for it.} that essentially expresses how some (preferably
meaningful) parts of the text relate to other parts. Keeping with the
principle of making the information about the seen portions of the
input easily accessible, practical grammars are normally required to
express the meaning of a fragment in a manner that does not depend
on the input that surrounds the fragment (i.e.~to be {\em
context-free@^context-free@>}). Real-world languages rarely satisfy
this requirement\footnote{Processing |typedef|'s in \Cee\ is a well
known case of such a language defect.} thus presenting a challenge to
parser generating software that assumes the language is context-free.
Even the task of parsing all context-free languages is too ambitious
in most practical scenarios, so further limitations on the grammar are
normally imposed. One may require that the next action of the parsing
algorithm must depend exclusively on the next symbol seen and one of
the finitely many {\em states\/} the parser may be in. The action here
simply refers to the choice of the next state, as well as the possible
decision to consume more input or output a portion of the {\em
abstract syntax tree@^abstract syntax tree@>\/} which is discussed
below.
The same language may have more than one grammar and the choice of the
latter normally has a profound effect on the selection of
the parsing algorithm. Without getting too deep into the parsing
theory, consider the following simple sketch.
\medskip
\beginprod
\format{\inline\flatten}
pexp:
'(' pexp ')' \
` astring \
;
%
astring:
\
` '*' astring \
;
\endprod
\medskip
\noindent Informally, the language consists of `strings of $n$ \.{*}'s
nested $m$ parentheses deep'. After parsing such a string, one might
be interested in the values of $m$ and $n$.
The three
states the parser may be in are `start', `parsing \prodstyle{pexp}' and
`parsing \prodstyle{astring}'. A quick glance at the grammar above
shows that switching between the states is straightforward (we omit
the discussion of the `start' state for brevity):
if the next symbol is \.{(}, parse the next~\prodstyle{pexp},
otherwise, if the next symbol is \.{*}, parse~\prodstyle{astring}.
Finally, if the next symbol is \.{)} and we
are parsing~\prodstyle{pexp}, finish parsing it and look for the next
input, otherwise, we are parsing~\prodstyle{astring}, finish parsing
it, make it a~\prodstyle{pexp}, finish parsing a~\prodstyle{pexp}
started by a parenthesis, and look for more input. This unnecessarily
long (as well as incomplete and imprecise) description serves
to present a simple fact that the
parsing states are most naturally represented by individual {\em
functions\/} resulting in what is known as a {\em recursive descent
parser@^recursive descent parser@>\/} in which the call stack is the
`data structure' responsible for keeping track of the parser's
state. One disadvantage of the algorithm above is that the maximal
depth of the call stack reaches $m+n$ which may present a problem for
longer strings.
Computing $m$ and $n$ above now reduces to incrementing an appropriate
variable upon exiting the corresponding function. More important,
however, is the observation that this parser specification can be
extracted from the grammar in a very straightforward fashion. To
better illustrate the r\^ole of the grammar in the choice of the
parsing algorithm, consider the following syntax for the same
language:
\medskip
\beginprod
\format{\inline\flatten}
pexp:
'(' pexp ')' \
` astring \
;
%
astring:
\
` astring '*' \
;
\endprod
\medskip
\noindent While the language is unchanged, so the algorithm
above still works, the lookahead tokens are not {\em immediately\/}
apparent upon looking at the productions. Some preprocessing must take
place before one can decide on the choice of the parser states and the
appropriate lookahead tokens. Such parser generating algorithms
indeed exist and will produce what is known as an {\sc LR} parser
for the fragment above (actually, a simpler {\sc LALR} parser may be built for this
grammar\footnote{Both of these algorithms will use the parser stack
more efficiently, effectively resolving the `call stack depth' issue
mentioned earlier.}). One can see that some grammar types may make
the selection of the parsing algorithm more involved. Since \splint\ relies
on \bison\ for the generation of the parsing algorithm, one must
ensure that the grammar is {\sc LALR}$(1)$\footnote{The newest versions of
\bison\ are capable of processing a {\em much\/} wider set of
grammars, although \splint\ can only handle the \bison\ output for
{\sc LALR}$(1)$ parsers.}.
@*1 Using the \eatone{bison}\bison\ parser.
The process of using \splint\ for writing parsing macros in \TeX\ is
treated in considerable detail later in this document. A shorter
(albeit somewhat outdated but still applicable) version of this
process is outlined in \cite[Sh], included as part of \splint's documentation.
We begin, instead, by explaining how one such parser can be used to pretty print a
\bison\ grammar. Following the convention mentioned above and putting
all non-\Cee\ code inside \CWEAVE's verbatim blocks, consider the
following (meaningless) code fragment\footnote{The software included in the package
contains a number of preprocessing scripts that reduce the necessity of using
the verbatim blocks for every line of the \bison\ code so the snippet above can
instead be presented without the distraction of \.{@@=...@@>}, looking more
like the `native' \bison\ input}. The fragment contains a mixture
of \Cee\ and \bison\ code, the former appears outside of the verbatim blocks.
\begindemo
^@@= non_terminal: @@>
^@@= term.1 term.2 {@@> a = b; @@=}@@>
^@@= **H term.3 other_term {@@> $$ = $1; @@=}@@>
^@@= **H still more terms {@@> f($1); @@=}@@>
^@@= ; @@>
\enddemo
The fragment above will appear as (the output of \CTANGLE\ can be
examined in \.{sill.y})
@=
@G
non_terminal:
term.1 term.2 {@> a = b; @=}
| term.3 other_term {@> $$ = $1; @=}
| still more terms {@> f($1); @=}
;
@g
@ $\ldots$ if the syntax is correct.
In case it is a bit off (note the missing colon after \.{whoops} below), the parser will give up and
you will see a different result. The code in the fragment below is easily
recognizable, and some parts of it (all of \Cee\ code, in fact) are
still pretty printed by \CWEAVE. Only the verbatim portion is left
unprocessed. The layout of the original code is reproduced unchanged, including the braces and production separators (i.e.\
\.{\yl}) normally removed by the parser for presentation purposes.
@=
@G
whoops
term.1 term.2 {@>@+ a = b; @+@=}
| term.3 other_term {@>@+ $$ = $1; @+@=}
| still more terms {@>@+ f($1); @+@=}
;
@g
@ The \TeX\ header that makes such output possible is quite plain. In the case
of this document it begins as
\begindemo
^\input limbo.sty
^\input frontmatter.sty
^\def\optimization{5}
^\input yy.sty
\nooutput
\enddemo
The first two lines are presented here merely for completeness: there is
no parsing-relevant code in them. The third line
(\.{\\def\\optimization\{5\}}) may be ignored for now (we discuss some
ways the parser code may be sped up
\locallink{optimization}later\endlink. The line that
follows loads the macros that implement the parsing and scanning
machinery.
This is enough to set up all the basic
mechanisms used by the parsing and lexing macros. The rest of the header
provides a few definitions to fine tune the typesetting of
grammar productions. It starts with
\begindemo
^\let\currentparsernamespace\parsernamespace
^ \let\parsernamespace\mainnamespace
^ \let\currenttokeneq\tokeneq
^ \def\tokeneq#1#2{\prettytoken{#1}}
^ \input bo.tok % re-use token equivalence table to set the typesetting of tokens
^ \let\tokeneq\currenttokeneq
^ \input btokenset.sty
\nooutput
\enddemo
We will have a chance to discuss all the \.{\\}$\ldots$\.{namespace}
macros later, at this point it will suffice to say that the lines
above are responsible for controlling the typesetting of term names. The
file \.{bo.tok} consists of a number of lines like the ones below:
\begindemo
^\tokeneq {STRING}{{34}{115}{116}{114}{105}{110}{103}{34}}
^\tokeneq {PERCENT_TOKEN}{{34}{37}{116}{111}{107}{101}{110}{34}}
\nooutput
\enddemo
The cryptic looking sequences of integers above are strings of {\sc ASCII}
codes of the letters that form the name that \bison\ uses when it needs to
refer to the corresponding token (thus, the second one is
\toksa{}\numberstochars{34}{37}{116}{111}{107}{101}{110}{34}\end
\.{\the\toksa} which might help explain why such an indirect scheme
has been chosen). The macro \.{\\tokeneq} is defined in
\.{yymisc.sty}, which in turn is input by \.{yy.sty} but what about
the token names themselves? In this case they were extracted
automatically from the \CWEB\ source file by the
\locallink{bootstrapping}{\em bootstrapping parser\/} \endlink during the
\CWEAVE\ processing stage. All of these definitions can be
overwritten to get the desired output (say, one might want to typeset
\.{ID} in a roman font, as `identifier'; all that needs to be done to
make this possible is to provide a macro that says \.{\\prettywordpair\{ID\}\{\{\\rm
identifier\}\}} in an appropriate namespace (usually
\.{\\hostparternamespace})). The file \.{btokenset.sty} input above
contains a number of such definitions.
@ To round off this short overview, I must mention a caveat associated
with using the macros in this collection: while one of the greatest
advantages of using \CWEB\ is its ability to rearrange the code in a
very flexible way, the parser will either give up or produce
unintended output if this feature is abused while describing the
grammar. For example, in the code below
@=
@G
next_term:
stuff @> @ @={@> a = f( x ); @=}
@g
@@;
@ the line titled |@| is intended to be a rule defined
later. Notice that while it seems that the parser was able to recognize
the first code fragment as a valid \bison\ input, it misplaced the
|@|, having erroneously assumed it to be a part of
the action code for this grammar (later on we will go into the details of
why it is necessary to collect all the non-verbatim output of \CWEAVE,
even that which contains no interesting \Cee\ code; hint: it has
something to do with money (\.{\$}), also known as math and the way
\CWEAVE\ processes the `gaps' between verbatim sections). The production
line that follows did not fare as well: the parser gave up. There
is simply no point in including such a small language fragment as a
valid input for the grammar the parser uses to process the verbatim
output.
@=
@G
more stuff in this line {@> @[b = g(y);@]@=}
@g
@ Finally, if you forget that only the verbatim part of the output is
looked at by the parser you might get something unrecognizable, such
as
@=
but not all of it
@ To correct this, one can provide a more complete grammar fragment to
allow the parser to complete its task successfully. In some cases,
this imposes too strict a constraint on the programmer. Instead, the
parser that pretty prints \bison\ grammars allows one to add {\it
hidden context\/} to the code fragments above. The context is added
inside \.{\\vb} sections using \CWEB's \.{@@t}$\ldots$\.{@@>} facility. The \CTANGLE\
output is not affected by this while the code above can now be typeset as:
@=
@G
next_term:
stuff @> @t}\vb{\formatlocal{\let\peekstash\stashtoterm}}{@> @ @t}\vb{FAKE}{@> @={@> a = f( x ); @=}
@g
@@;
@ $\ldots$ even a single line can now be displayed properly.
@=
@G
@t}\vb{\formatlocal{\skipheader} FAKE:}{@>
more stuff in this line {@> b = g( y ); @=}
@g
@ With enough hidden context, even a small rule fragment can be
typeset as intended. The `action star' was inserted to reveal some of
the context.
@=
@G
@t}\vb{\formatlocal{\skipheader} FAKE:}{@>
but not all of it
@t}\vb{\{\stashed{$\star$}\}}{@>
@g
@ What makes all of this even more confusing is that \CTANGLE\ will
have no trouble outputting this as a(n almost, due to the
intentionally bad \.{whoops} production above) valid \bison\ file
(as can be checked by looking into \.{sill.y}). The author
happens to think that one should not fragment the software into pieces
that are too small: \bison\ is not \Cee\ so it makes sense to write
\bison\ code differently. However, if the logic behind your code
organization demands such fine fragmentation, hidden context provides
you with a tool to show it off. A look inside the source of this
document shows that adding hidden context can be a bit ugly so it is
not recommended for routine use. The short example above is output in
the file below.
@(sill.y@>=
@@;
@*1 On debugging. This concludes a short introduction to the \bison\
grammar pretty printing using this macro collection. It would be
incomplete, however, without a short reference to debugging\footnote{At
the moment we are discussing debugging the output produced by \CWEAVE\ when
the included \bison\ parser is used, {\it not\/} debugging parsers
written with the help of this software: the latter topic is covered in more
detail later on.}. There is a
fair amount of debugging information that the macros can output,
unfortunately, very little of it is tailored to the {\em use\/} of the
macros in the \bison\ parser. Most of it is designed to help build a
{\em new\/} parser. If you find that the \bison\ parser gives up too often
or even crashes (the latter is most certainly a bug in the \splint\
version of the \bison\ parser itself), the first approach is to make
sure that your code {\em compiles}, i.e.\ forget about the printed
output and try to see if the `real' \bison\ accepts the code (just the
syntax, no need to worry about conflicts and such).
If this does not shed any light on why the macros seem to fail, turn
on the debugging output by saying \.{\\trace$\ldots$true} to activate the
appropriate trace macros. This may produce {\it a lot\/} of output, even for
small fragments, so turn it on for only a section at a time. If you
need still {\it more\/} details of the inner workings of the parser
and the lexer, various other debugging conditionals are available. For
example, \.{\\yyflexdebugtrue} turns on the debugging output for the
scanner. There are a number of such conditionals that are discussed in
the commentary for the appropriate \TeX\ macros. Most of these
conditionals are documented in \.{yydebug.sty}, which provides a
number of handy shortcuts for a few commonly encountered
situations, as well.
Remember, what you are seeing at this point is the parsing process of
the \bison\ input file, not the one for {\it your\/} grammar (which
might not even be complete at this point). However, if all of the
above fails, you are on your own: drop me a line if you figure out how
to fix any bugs you find.
@** Terminology. \namedspot{terminology}This short chapter is an informal
listing of a few loose definitions of
the concepts used repeatedly in this documentation. Most of this terminology is
rather standard. Formal precision is not the goal here, instead, intuitive
explanations are substituted whenever possible.
{%
\def\aterm#1{\item{\sqebullet}{\ttl #1}: \ignorespaces}%
\setbox0=\hbox{\sqebullet\enspace}
\parindent=0pt
\advance\parindent by \wd0
\smallskip
\aterm{bison {\rm(as well as} flex{\rm)} parser{\rm(}s{\rm)}}
while, strictly speaking, not a formally defined
term, this combination will always stand for one of the parsers generated
by this package designed to parse a subset of the `official' grammar for
\bison\ or \flex\ input files. All of these parsers are described later in
this documentation. The term {\it main parser\/} will be
used as a substitute in example documentation for the same purpose.
\aterm{driver} a generic but poorly defined concept. In this
documentation it is used predominantly to mean both the \Cee\ code and
the resulting executable that outputs the \TeX\ macros that contain the
parser tables, token values, etc., for the parsers built by the user. It
is understood that the \Cee\ code of the `driver' is unchanged and the
information about the parser itself is obtained by {\it including\/} the \Cee\
file produced by \bison\ in the `driver' (see the examples supplied
with the package).
\aterm{lexer} a synonym for {\it scanner}, a subroutine that performs the {\it
lexical analysis\/} phase of the parsing process, i.e.\ groups various
characters from the input stream into parser {\it tokens}.
\aterm{namespace} this is an overused bit of terminology meaning a
set of names grouped together according to some relatively
well defined principle. In a language without a well developed type
system (such as \TeX) it is usually accompanied by a specially designed
naming scheme. {\it Parser namespaces\/} are commonly used in this
documentation to mean a collection of all the data structures describing a
parser and its state, including tables, stacks, etc., named by using the
`root' name (say \.{\\yytable}) and adding the name of the parser (for
example, \.{[main]}). To support this naming scheme, a number of
macros work in unison to create and rename the `data macros'
accordingly\footnote{To be precise, the {\em namespaces\/} in this
manual, would more appropriately be referred to as {\em named
scopes}. The {\em tag namespace\/} in \Cee\ is an example of a
(built-in) language namespace where the {\em grammatical r\^ole\/} of the
identifier determines its association with the appropriate set.}.
\aterm{parser stack}
a collection of parsers, usually derived from a common set of
productions, and sharing a common lexer. As the name suggests, the
parsers in the collection are tried in order until the input is parsed
successfully or every parser has been tried. This terminology may become a
source of some confusion, since each parsing algorithm used by \bison\
maintains several stacks. We will always refer to them by naming a specific
task the stack is used for (such as the {\em value stack\/} or the
{\em state stack}, etc.).
\aterm{pretty printing {\rm or} program visualization}
The terms above are used interchangeably in this manual to mean
typesetting the program code in a way that emphasizes its meaning as
seen by the author of the program\footnote{Or the person
typesetting the code.}. It is usually assumed that such
meaning is extracted by the software (a specially designed {\em
parser\/}) and translated into a suitable visual representation.
\aterm{symbolic switch} a macro (or an associative array of macros)
that let the \TeX\ parser generated by the package associate {\it
symbolic term names\/} (called {\it named references\/} in the official
\bison\ documentation) with the terms. Unlike the `real' parser, the
parser created with this suite requires some extra setup as explained
in the included examples (one can also consult the source for this
documentation which creates but does not use a symbolic switch).
\aterm{symbolic term name} (also refered to as a {\it named reference\/}
in the \bison\ manual): a (relatively new) way to refer to stack
values in \bison. In addition to using the `positional' names such as
\.{\$}$n$ to refer to term values, one can utilize the new syntax:
\.{\$}\.{[}\\{name}\.{]} (or even \.{\$}\\{name} when the \\{name}
has a tame enough syntax). The `\\{name}' can be assigned by the
user or can be the name of the nonterminal or token used in the
productions.
\aterm{term} in a narrow sense, an `element' of a grammar. Instead of
a long winded definition, an example, such as \prodstyle{ID} should
suffice. Terms are further classified into {\it terminals\/} (tokens)
and {\it nonterminals\/} (which may be intuitively thought of as
composite terms).
\aterm{token} in short, an element of a set. Usually encoded as an
integer by most parsers, a {\em token\/} is an indivisible {\em term\/}
produced for the parser by the scanner. \TeX's scanner uses a more
sophisticated token classification, for example, $($character code,
character category$)$ pairs, etc.
}
@** Languages, scanners, parsers, and \TeX. % Or $\ldots$
\vtop{\halign to\hsize{\kern-1.5pt\it#\hfil\tabskip0pt plus1fil\cr
Tokens and tables keep macros in check.\cr
Make 'em with \bison, use \.{WEAVE} as a tool.\cr
Add \TeX\ and \CTANGLE, and \Cee\ to the pool.\cr
Reduce 'em with actions, look forward, not back.\cr
Macros, productions, recursion and stack!\cr
\noalign{\vskip2pt}
\omit\hfil\eightpoint Computer generated (most likely)\cr}}
\bigskip
\def\recount#1{${}^{(#1)}$}%
\noindent In order to understand the parsing routines in this collection,
it would help to gain some familiarity with the internals of the
parsers produced by \bison\ for its intended target: \Cee. A person
looking inside a parser delivered by \bison\ would
quickly discover that the parsing procedure itself (|yyparse|)
occupies a rather small portion of the file. If (s)he were to further
reduce the size of the file by removing all the preprocessor
directives intended to anticipate every conceivable combination of the
operating system, compiler, and \Cee\ dialect, and various reporting
and error logging functions it would become very clear that the most
valuable product of \bison's labor is a collection of integer {\it
tables\/} that control the actions of the parser routine. Moreover,
the routine itself is an extremely concise and well-structured loop
composed of |goto|'s and a number of numerical conditionals. If one
could think of a way of accessing arrays and processing conditionals
in the language of one's choice, once the tables produced by \bison\
have been converted into a form suitable for the consumption by the
appropriate language engine, the parser implementation becomes
straightforward. Or nearly so.
The {\it scanning\/} (or {\it lexing\/}) step of this process---a way
to convert a stream of symbols into a stream of integers, deserves
some attention, as well. There are a number of excellent programs
written to automate this step in much the same fashion as \bison\
automates the generation of parsers. One such tool, \flex, though
(in the opinion of this author) slightly lacking in the simplicity and
elegance when compared to \bison, was used to implement the lexer for
this software suite. Lexing in \TeX\ will be discussed in considerable
detail later in this manual.
The language of interest in our case is, of course, \TeX, so our
future discussion will revolve around the five elements mentioned
above: \recount{1}data structures (mainly arrays and stacks),
\recount{2}converting
\bison's output into a form suitable for \TeX's consumption,
\recount{3}processing raw streams of \TeX's tokens and converting them into
streams of parser tokens, \recount{4}the implementation of \bison's
|yyparse| in \TeX, and, finally, \recount{5}producing \TeX\ output via {\it
syntax-directed translation} (which requires an appropriate
abstraction to represent \bison's actions inside \TeX). We shall
begin by discussing the parsing process itself.
@*1 Arrays, stacks, and the parser.
Let us briefly examine the programming environment offered by \TeX.
Designed for typesetting, \TeX's remarkable language
provides a layer of macro processing atop of a set of commands that
produce the output fulfilling its primary mission: delivering page
layouts. In The \TeX book, the macro {\it expansion\/} is likened to
mastication, whereas \TeX's main product, the typographic output is the
result of its `digestion' process. Not everything that goes through
\TeX's digestive tract ends up leaving a trace on the final page: a
file full of \.{\\relax}'s will produce no output, even though
\.{\\relax} is not a macro, and thus would have to be processed by
\TeX\ at the lowest level.
It is time to describe the details of defining suitable data structures
in \TeX. At first glance, \TeX\ provides rather standard means of
organizing and using the memory. At the core of its generic
programming environment is an array of \.{\\count}$\,n$ {\it
registers\/}, which may be viewed as general purpose integer variables
that are randomly accessible by their indices. The integer arithmetic
machinery offered by \TeX\ is spartan but is very adequate for the sort of
operations a parser would perform: mostly additions and
comparisons.
Is the \.{\\count} array a good way to store tables in \TeX? Probably
not. The first factor is the {\it size\/} of this array: only 256
\.{\\count} registers exist in a standard \TeX\ (the actual number of
such registers on a typical machine running \TeX\ is significantly
higher but this author is a great believer in standards, and to his
knowledge, none of the standardization efforts in the \TeX\ world has
resulted in anything even close to the definitive masterpiece that is
The \TeX book). The issue of size can be mitigated to some extent by
using a number of other similar arrays used by \TeX\ (\.{\\catcode},
\.{\\uccode}, \.{\\dimen}, \.{\\sfcode} and others can be used for
this purpose as long as one takes care to restore the `sane' values
before the control is handed off to \TeX's typesetting mechanisms). If a
table has to span several such arrays, however, the complexity of
accessing code would have to increase significantly, and the issue of
size would still haunt the programmer.
The second factor is the utilization of several registers by \TeX\ for special
purposes (in addition, some of these registers can only store a
limited range of values). Thus, the first 10 \.{\\count} registers are
used by the plain \TeX\ for (well, {\it intended\/} for, anyway) the
purposes of page accounting: their values would have to be carefully
saved and restored before and after each parsing call,
respectively. Other registers (\.{\\catcode} in particular) have even
more disrupting effects on \TeX's internal mechanisms. While all of
this can be managed (after all, using \TeX\ as an arithmetic engine
such as a parser suspends the need for any typographic or other
specialized functions controlled by these arrays), the added
complexity of using several memory banks simultaneously and the speed penalty
caused by the need to save and restore register values make this
approach much less attractive.
What other means of storing arrays are provided by \TeX? Essentially,
only three options remain: \.{\\token} registers, macros holding whole
arrays, and associative arrays accessed through
\.{\\csname}$\,\ldots\,$\.{\\endcsname}. In the first two cases if care
is taken to store such arrays in an
appropriate form one can use \TeX's \.{\\ifcase} primitive to access
individual elements. The trade-off is the speed of such
access: it is {\it linear\/} in the size of the array for most
operations, and worse than that for others, such as removing the last
item of an array. Using clever ways
of organizing such arrays, one can improve the linear access time to
$O(\log n)$ by simply modifying the access macros but at the moment, a
straightforward \.{\\ifcase} is used after expanding a list macro or
the contents of a \.{\\token}$\,n$ register in an {\it un\/}optimized
parser. An {\it optimized\/} parser uses associative arrays.
The array discussion above is just as applicable to {\it stacks\/}
(indeed, an array is the most common form of stack
implementation). Since stacks pop up and disappear frequently (what
else are stacks to do?), list macros are usually used to store
them. The optimized parser uses a separate \.{\\count} register to
keep track of the top of the stack in the corresponding associative
array\footnote{Which means, unfortunately, that making such fully optimized
parser {\em reentrant\/} would take an extraordinary amount of effort. Hence,
if reentrancy is a requirement, stacks are better kept inside list macros.}.
Let us now switch our attention
to the code that implements the parser and scanner {\it functions\/}.
If one has spent some time writing \TeX\ macros of any sophistication
(or any macros, for that matter) (s)he must be familiar with the general
feeling of frustration and the desire to `just call a function here and move
on'. Macros\footnote{Formally defined as `$\ldots$ special
compile-time functions that consume and produce {\em syntax objects}'
in~\cite[DHB].} produce {\it tokens\/}, however, and tokens must either
expand to nothing or stay and be contributed to your input, or worse,
be out of place and produce an error. One way to sustain a stream
of execution with macros is {\it tail recursion\/} (i.e.~always expanding the
{\it last token left standing}).
As we have already discussed, \bison's
|yyparse()| is a well laid out loop organized as a sequence of
|goto|'s (no reason to become religious about structured programming
here). This fact, and the following well known trick, make \Cee\ to \TeX\
translation nearly straightforward. The macro \TeX niques employed by the
sample code below are further discussed elsewhere in this manual.
% The macro mess below looks painful but this is the only place such layout is used
% The approach can be easily generalized and put in limbo.sty but it seems
% a bit redundant at this point.
\newcount\piccount
\newdimen\lasthsize
\setbox5=\vtop{
\demomargin=0pt
\let\demoastyle\empty
\begindemo
^label A: ...
\nooutput
^ if**L**Krm(condition)**N
^ goto C;
\nooutput
^label B: ...
\nooutput
^ goto A;
\nooutput
^label C: ...
\nooutput
\enddemo
}
\dp5=\z@@
\setbox3=\vtop{
\demomargin=0pt
\let\demoastyle\empty
\begindemo
^\if**L**Krm(condition)**N
^ \let\next=\labelC
^\else
^ \let\next=\labelAtail
\enddemo
}
\dp3=\z@@
\newdimen\lastdepth
\def\startfitpar{%
\bgroup
\lasthsize=\hsize
\advance\lasthsize-1.5in
\vsize=\baselineskip
\topskip=\z@@
\setbox0\box2 % empty it
% this sounds good at first but there is no good way to pull the insertions out after the
% box manipulations that follow;
% insertions will thus be contributed to whatever page was being worked on when the
% picture insertions {\it started}; hence, if these happen to start at the very top of the page,
% any insertion that follows will be contributed to the previous page; we correct this for footnotes
% below
% \holdinginserts=1
\output{%
\global\setbox2=\vbox{
\ifvoid2
\else
\prevdepth=\dp2
\unvbox2
\fi
\lastdepth=\dp255
\unvbox255
% this would be tempting, however, the \eject that follows should disappear
% in addition, one really should not be playing with page breaking in the middle of
% such tricky insertions
% \penalty\outputpenalty
% \kern-\lastdepth % to make sure \baselineskip is accounted for
}%
}\eject
\output{%
\setbox0=\vbox{%
\unvbox255%
}% \lastbox would almost work ... if not for insertions
\global\advance\piccount1
\global\setbox2=\vbox{%
\prevdepth=\dp2 \unvbox2
\hbox to\hsize{%
\ifnum\piccount<15
\hbox to1.5in{%
\ifnum\piccount=1
\ \box5
\fi
\hfill}%
\fi
\box0 \hfill
\ifnum\piccount=1
\box3 \ %
\fi
\ifvoid\footins % reinsert footnotes
\else
\insert\footins{\unvbox\footins}%
\fi
}%
}%
}%
\parshape=15
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt 2.7in
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \lasthsize
0pt \hsize
}
\def\endfitpar{%
\par
\eject
\egroup
% see the comment above
% \holdinginserts=0
\prevdepth=\dp2
\unvbox2
}
\startfitpar
\noindent Given the code on the left (where |goto|'s
are the only means of branching but can appear inside conditionals),
one way to translate it into \TeX\ is to define a set of macros (call
them \.{\\labelA}, \.{\\labelAtail} and so forth for clarity) that end in
\.{\\next} (a common name for this purpose). Now, \.{\\labelA} will
implement the code that comes between \.{label A:} and \.{goto C;},
whereas \.{\\labelAtail} is responsible for the code after \.{goto C;}
and before \.{label B:}
(provided no other |goto|'s intervene which can always be
arranged). The conditional which precedes \.{goto C;} can now be written in
\TeX\ as presented on the right, where (condition) is an appropriate
translation of the corresponding condition
in the code being translated (usually, one of `$=$' or `$\not=$'). Further
details can be extracted from the \TeX\ code that implements these
functions where the corresponding \Cee\ code is presented alongside
the macros that mimic its functionality%
\footnote{Running the risk of overloading the reader with details, the author
would like to note that the actual implementation follows a {\it slightly\/} different
route in order to avoid any \.{\\let} assignments or changing the
meaning of \.{\\next}}.
This concludes the overview of the general approach,
It is time to consider the way characters get consumed
on the lower levels of the macro hierarchy and the interaction between the different
layers of the package.
\endfitpar
@*1 \TeX\ into tokens.
Thus far we have covered the ideas
behind items \recount{1} and \recount{4} on our list. It is time to
discuss the lowest level of processing performed by these macros:
converting \TeX's tokens into the tokens consumed by the parser,
i.e.\ part \recount{3} of the plan. Perhaps, it would be most appropriate
to begin by reviewing the concept of a {\it token}.
As commonly defined, a token is simply an element of a set (see the section
on \locallink{terminology}terminology\endlink\ earlier in this manual).
Depending on
how much structure the said set possesses, a token can be represented by
an integer or a more complicated data structure. In the discussion
below, we will be dealing with two kinds of tokens: the tokens
consumed by the parsers and the \TeX\ tokens seen by the input
routines. The latter play the r\^ole of {\it characters\/} that combine
to become the former. Since \bison's internal representation for its tokens
is non-negative integers, this is what the scanner must produce.
\TeX's tokens are a good deal more sophisticated: they can be
either pairs $(c_{\rm ch}, c_{\rm cat})$, where $c_{\rm ch}$ is the
character code and $c_{\rm cat}$ is \TeX's category code ($1$ and $2$ for
group characters, $5$ for end of line, etc.), or {\it control
sequences\/}, such as \.{\\relax}. Some of these tokens (control
sequences and {\it active}, i.e.~category~13 characters) can have
complicated internal structure (expansion). The situation is further
complicated by \TeX's \.{\\let} facility, which can create
`character-like' control sequences, and the lack of conditionals
to distinguish them from the `real' characters. Finally, not all pairs
can appear as part of the input (say, there is no $(n, 0)$ token for
any $n$, in the terminology above).
The scanner expects to see {\it characters} in its input, which are
represented by their {\sc ASCII} codes, i.e.~integers between $0$ and
$255$ (actually, a more general notion of the Unicode character is
supported but we will not discuss it further). Before character codes
appear as the input to the scanner, however, and make its integer
table-driven mechanism `tick', a lot of work must be done to collect
and process the stream of \TeX\ tokens produced after \CWEAVE\ is done
with your input. This work becomes even more complicated when the
typesetting routines that interpret the parser's output must sneak
outside of the parsed stream of text (which is structured by the
parser) and insert the original \TeX\ code produced by \CWEAVE\ into
the page.
\splint\ comes with a customizeable input routine of
moderate complexity (\.{\\yyinput}) that classifies all \TeX\ tokens
into seven categories: `normal' spaces (i.e.~category~10 tokens,
skipped by \TeX's parameter scanning mechanism),
`explicit' spaces (includes the control sequences \.{\\let} to \.{\ },
as well as \.{\\\ }), groups ({\it avoid} using \.{\\bgroup} and \.{\\egroup} in
your input but `real', \.{\{}$\ldots$\.{\}} groups are fine), active
characters, normal characters (of all character categories that can
appear in \TeX\ input, including \.{\$}, \.{\^}, \.{\#}, \.{a}--\.{Z},
etc.), single letter control sequences, and multi-letter control
sequences. Each of these categories can be processed separately to
`fine-tune' the input routine to the problem at hand. The input
routine is not very fast, instead, flexibility was the main
goal. Therefore, if speed is desirable, a customized input routine
is a great place to start. As an example, a minimalistic
\.{\\yyinputtrivial} macro is included.
When \.{\\yyinput} `returns' by calling \.{\\yyreturn} (which is a
macro you design), your lexing routines have access to three
registers: \.{\\yycp@@}, that holds the character value of the
character just consumed by \.{\\yyinput}, \.{\\yybyte}, that most of
the time holds the token just removed from the input,
and \namedspot{yybytepure-discussion}\.{\\yybytepure}, that (again, with very few
exceptions) holds a `normalized' version of the read character (i.e.~a
character of the same character code as \.{\\yycp@@}, and category~12
(to be even more precise (and to use nested parentheses), `normalized'
characters have the same category code as that of `\.{.}' at the point
where \.{yyinput.sty} is read)).
Most of the time it is the character code one needs (say, in the case
of \.{\\\{}, \.{\\\}}, \.{\\\&} and so on) but under some circumstances the
distinction is important (outside of \.{\\vb\{}$\ldots$\.{\}}, the sequence
\.{\\1} has nothing to do with the digit `\.{1}'). This mechanism
makes it easy to examine the consumed token. It also forms
the foundation of the `hidden context' passing mechanism described later.
The remainder of this section discusses the internals of \.{\\yyinput}
and some of the design trade-offs one has to make while working on
processing general \TeX\ token streams. It is typeset in `small print'
and can be skipped if desired.
\smallskip
\begingroup
\abovedisplayskip=5pt%
\abovedisplayshortskip=2pt%
\belowdisplayskip=5pt%
\belowdisplayshortskip=2pt%
\fnotesstart=1
\fnotesspan=2
\noofcolumns=2
\icgap=1em%
\eightpoint
\linecount=73
\setmcparams
\def\.#1{{\chardef\\=`\\\chardef\&=`\&\tt #1}}%
\dsskip=0pt%
\begindoublecols
To examine every token in its path (including spaces that are easy to
skip), the input routine uses one of the two well-known {\sc \TeX}nologies:
\.{\\futurelet\\next\\examinenext} or its equivalent
\.{\\afterassignment\\examinenext\\let\\next=}\hbox{\tt\char"20}.
Recursively inserting one of these sequences, \.{\\yyinput} can go
through any list of tokens, as long as it knows where to stop
(i.e.~return an end of file character). The
signal to stop is provided by the \.{\\yyeof}
sequence, which should not appear in any `ordinary' text
presented for parsing, other than for the purpose of providing such a
stop signal. Even the dependence on \.{\\yyeof} can be eliminated if
one is willing to invest the time in writing macros that juggle \TeX's
\.{\\token} registers and only limit oneself to input from such
registers (which is, aside from an obvious efficiency hit, a strain on
\TeX's memory, as you have to store multiple (3 in the general case)
copies of your input to be able to back up when the lexer makes a
wrong choice). Another approach to avoid the use of stop tokens is
to store the whole input as a {\it parameter\/} for the appropriate macro.
This scheme is remarkably powerful and can produce {\it expandable\/} versions
of very complicated routines, although the amount of effort required to
write such macros grows at a frightening rate. As the text inside
\.{\\vb\{}$\ldots$\.{\}} is nearly always well structured, the care that
\.{\\yyinput} takes in processing such character lists is an
overkill. In a more `hostile' environment (such as the one encountered
by the now obsolete \.{\\Tex} macros), however, this extra attention to detail pays
off in the form of a more robust input mechanism.
One subtlety deserves a special mention here, as it can be important
to the designer of `higher-level' scanning macros. Two types of tokens
are extremely difficult to deal with whenever \TeX's own lexing
mechanisms are used: (implicit) spaces and even more so, braces. We
will only discuss braces here, however, almost everything that follows
applies equally well to spaces (category 10 tokens to be precise), with
a few simplifications (or complications, in a couple of places). To
understand the difficulty, let's consider one of the approaches above:
$$
\.{\\futurelet\\next\\examinenext}.
$$
The macro \.{\\examinenext}
usually looks at \.{\\next} and inserts another macro (usually also called
\.{\\next}) at the very end of its expansion list. This macro usually
takes one parameter, to consume the next token. This mechanism works
flawlessly, until the lexer encounters a \.{\{}br\.{,}sp\.{\}}ace. The \.{\\next}
sequence, seen by \.{\\examinenext} contains a lot of information
about the brace ahead: it knows its category code (left brace, so $1$), its
character code (in case there was, say a \.{\\catcode`\\[=1{\tt\char`\ }}
earlier) but not whether it is a `real' brace (i.e.\ a character
\.{\{}$_1$) or an implicit one (a \.{\\bgroup}). There is no way to find
that out until the control sequence `launched' by \.{\\examinenext}
sees the token as a parameter.
If the next token is a `real' brace, however,
\.{\\examinenext}'s successor will never see the token itself: the
braces are stripped by \TeX's scanning mechanism. Even if it finds a
\.{\\bgroup} as the parameter, there is no guarantee that the actual
input was not \.{\{\\bgroup\}}. One way to handle this is by applying
\.{\\string} before consuming the next token. If prior to
expanding \.{\\string} care has been taken to set the \.{\\escapechar}
appropriately (remember, we know the character code of the next
token in advance), as soon as one sees a character with
\.{\\escapechar}'s character code,
(s)he knows that an implicit brace has just been seen. One added
complication to all this is that a very determined programmer can
insert an {\it active\/} character (using, say, the \.{\\uccode}
mechanism) that has the {\it same\/} character code as the {\it
brace\/} token that it has been \.{\\let} to! Even setting this
disturbing possibility aside, the \.{\\string} mechanism
(or, its cousin, \.{\\meaning}) is
far from perfect: both produce a sequence of category 12 and 10 tokens
that are mixed into the original input. If
it is indeed a brace character that we just saw, we can consume the next
token and move on but what if this was a control sequence? After all,
just as easily as \.{\\string} makes a sequence into characters,
\.{\\csname}$\,\ldots\,$\.{\\endcsname} pair will make any sequence of
characters into a control sequence so determining the end the character
sequence produced by \.{\\string} may prove impossible.
$$
\hbox{Huh~$\ldots$}
$$
What we need is a backup mechanism: keeping a copy of the
token sequence ahead, one can use \.{\\string} to see whether
the next token is a real
brace first, and if it is, consume it and move on (the active character
case can be handled as the implicit case below, with one extra backup
to count how many tokens have been consumed). At this point the brace has to be {\it
reinserted\/} in case, at some point, a future `back up'
requires that the rest of the tokens are removed from the output (to
avoid `\.{Too many \}'s}' complaints from \TeX). This can be done by using
the \.{\\iftrue\{\\else\}\\fi} trick (and a generous sprinkling of \.{\\expandafter}s).
Of course, some bookkeeping is needed to keep track of how deep inside the
braced groups we are.
For an implicit brace, more work is needed: read all the
characters that \.{\\string} produced (and maybe more), then remember
the number of characters consumed. Remove the rest of the input using
the method described above and restart the scanning from the same point
knowing that the next token can be scanned as a parameter.
Another strategy is to design a general enough macro that counts
tokens in a token register and simply recount the tokens after every
brace was consumed.
Either way, it takes a lot of work. If anyone would
like to pursue the counting strategy, simple counting macros
are provided in \.{/examples/count/count.sty}.
The macros in this example
supply a very general counting mechanism that does not depend on
\.{\\yyeof} (or {\it any\/} other token) being `special' and can count the
tokens in any token register, as long as none of those tokens is an
\.{\\outer} control sequence. In other words, if the macro is used
immediately after the assignment to the token register, it should
always produce a correct count.
Needless to say, if such a general mechanism is desired, one has to
look elsewhere. The added complications of treating spaces (\TeX\
tends to ignore them most of the time) make this a torturous exercise
in \TeX's macro wizardry.
The included \.{\\yyinput} has two ways of
dealing with braces: strip them or view the whole group as a
token. Pick one or write a different \.{\\yyinput}. Spaces, implicit
or explicit, are reported as a specially selected character code and
consumed with a likeness of
\.{\\afterassignment\\moveon\\let\\next={\tt\char`\ }}. This behavior
can be adjusted if needed.
Now that a steady stream of character codes is arriving at \.{\\yylex}
after \.{\\yyreturn} the job of converting it into numerical tokens
is performed by the {\it scanner} (or {\it lexer\/}, or {\it tokenizer\/},
or even {\it tokener}), discussed in the next section.
\enddoublecols
\endgroup
@*1 Lexing in \TeX. In a typical system that uses a parser to process
text, the parsing pass is usually split into several stages: the raw
input, the lexical analysis (or simply {\it lexing}), and the parsing
proper. The {\it lexing\/} pass (also called {\it scanning}, we use these
terms interchangeably) clumps various sequences of characters into
{\it tokens\/} to facilitate the parsing stage. The reasons for this
particular hierarchy are largely pragmatic and are partially historic
(there is no reason that {\it parsing\/} cannot be done in multiple
phases, as well, although it usually isn't).
If one recalls a few basic facts from the formal language theory, it
becomes obvious that a lexer, that parses {\it regular\/} languages,
can be (in theory) replaced by an {\sc LALR} parser, that parses {\it
context-free\/} ones (or some subset thereof, which is
still a super set of all regular languages). A common justification given for
creating specialized lexers is efficiency and speed. The
reality is somewhat more subtle. While we do care about the efficiency of
parsing in \TeX, having a specialized scanner is important for
a number of different reasons.
The real advantage of having a dedicated scanner is the ease with which it
can match incomplete inputs and back up. A parser can, of course,
{\it recognize\/} any valid input that is also acceptable to a lexer, as well
as {\it reject\/} any input that does not form a valid token. Between
those two extremes, however, lies a whole realm of options that a
traditional parser will have great difficulty exploring. Thus, to
mention just one example, it
is relatively easy to set up a {\sc DFA}\footnote{Which stands for
Deterministic Finite Automaton, a common (and mathematically unique)
way of implementing a scanner for regular languages. Incidentally {\sc
LALR} mentioned above is short for Look Ahead Left to Right.}
so that the {\it longest\/}
matching input is accepted. The only straightforward way to do this
with a traditional parser is to parse longer and longer inputs again
and again. While this process can be optimized to a certain degree,
the fact that a parser has a {\it stack\/} to maintain limits its
ability to back up\footnote{It should be also mentioned that the fact that
the lexing pass takes place prior to the parser consuming the resulting tokens
allows one to process some grammars that are not context free. See, for example
the {\em parser hack\/} used to process |typedef|s in \Cee.}.
As an aside, the mechanism by which \CWEB\ assembles its `scraps'
into chunks of recognized code is essentially iterative lexing,
very similar to what a human does to make sense of complicated
texts. Instead of trying to match the longest running piece of text,
\CWEB\ simply looks for patterns to combine inputs into larger
chunks, which can later be further combined. Note that this is not
quite the same as the approach taken by, say {\sc GLR} parsers, where
the parser must match the {\it whole\/} input or declare a
failure. Where a \CWEB-type parser may settle for the first available
match (or the longest available) a {\sc GLR} parser must try {\it
all\/} possible matches or use an algorithm to reject the majority of
the ones that are bound to fail in the end.
This `\CWEB\ way' is also different from a traditional `strict' {\sc
LR} parser/scanner approach and certainly deserves serious
consideration when the text to be parsed possesses some rigid
structure but the parser is only allowed to process it one small
fragment at a time.
Returning to the present macro suite, the lexer produced by \flex\
uses integer tables similar to those employed by \bison\ so the
usual {\sc\TeX}niques used in implementing \.{\\yyparse} are fully
applicable to \.{\\yylex}.
An additional advantage provided by having a \flex\ scanner implemented
as part of the suite is the availability of the original \bison\ scanner written
in \Cee\ for the use by the macro package.
This said, the code generated by \flex\ contains a few idiosyncrasies
not present in the \bison\ output. These `quirks' mostly involve
handling of end of input and error conditions. A quick glance at the
\.{\\yylex} implementation will reveal a rather extensive collection of
macros designed to deal with end of input actions.
Another difficulty one has to face in translating \flex\ output into
\TeX\ is a somewhat unstructured namespace delivered in the final
output (this is partially due to the \POSIX\ standard that \flex\
strives to follow). One consequence of this `messy' approach is that the
writer of a \flex\ scanner targeted to \TeX\ has to declare \flex\
`states' (more properly called {\it subautomata}) twice: first for the
benefit of \flex\ itself, and then again, in the {\it \Cee\ preamble\/}
portion of the code to output the states to be used by the action code
in the lexer. \.{Define\_State($\ldots$)} macro is provided for this
purpose. This macro can be used explicitly by the programmer or be
inserted by a specially designed parser.
Using \CWEB\ helps to keep these declarations together.
The `hand-off' from the scanner to the parser is implemented
through a pair of registers: \.{\\yylval}, a token register
containing the value of the returned token and \.{\\yychar}, a
\.{\\count} register that contains the numerical value of the
token to be returned.
Upon matching a token, the scanner passes one crucial piece of
information to the programmer: the character sequence representing the token
just matched (\.{\\yytext}). This is not the whole story
though as there are three more token sequences that are made available
to the parser writer whenever a token is matched.
The first of these is simply a `normalized' version of
\.{\\yytext} (called \.{\\yytextpure}). In most cases it
is a sequence of \TeX\ tokens with the same character codes as the one
in \.{\\yytext} but with their category codes set to 12
(see the discussion of \.{\\yybytepure}
\locallink{yybytepure-discussion}above\endlink). In
cases when the tokens in \.{\\yytext} are {\it not}
$(c_{\rm ch}, c_{\rm cat})$ pairs, a few simple
conventions are followed, some of which will be explained below. This
sequence is provided merely for convenience and its typical use is to
generate a key for an associative array.
The other two sequences are special `stream pointers' that provide
access to the extended scanner mechanism in order to implement the passing
of the `formatting hints' to the parser, as well as incorporate
\CWEAVE\ formatted code into the input, without introducing any changes to
the original grammar. As the mechanism itself and the motivation
behind it are somewhat subtle, let us spend a few moments discussing
the range of formatting options desirable in a generic pretty-printer.
Unlike strict parsers employed by most compilers, a parser designed
for pretty printing cannot afford being too picky about the structure
of its input (\cite[Go] calls such parsers `loose'). To provide
a simple illustration, an isolated identifier, such as `\.{lg\_integer}'
can be a type name, a variable name, or a structure tag (in a language like
\Cee\ for example). If one expects the pretty printer to typeset this
identifier in a correct style, some context must be supplied, as
well. There are several strategies a pretty printer can employ to get
a hold of the necessary context. Perhaps the simplest way to handle
this, and to reduce the complexity of the pretty printing algorithm is
to insist on the programmer providing enough context for the parser to do
its job. For short examples like the one above, this may be an acceptable
strategy. Unfortunately, it is easy to come up with longer snippets of
grammatically deficient text that a pretty printer should be expected
to handle. Some pretty printers, such as the one employed by \CWEB\
and its ilk (the original \.{WEB}, \.{FWEB}), use a very flexible
bottom-up technique that tries to make sense of as large a portion of
the text as it can before outputting the result (see also \cite[Wo],
which implements a similar algorithm in \LaTeX).
The expectation is that this algorithm will handle the majority (about
90\%? it would be interesting to carry out a study in the spirit of
the ones discussed in \cite[Jo] to find out) of the
cases with the remaining few left for the author to correct. The
question is, how can such a correction be applied?
\CWEB\ itself provides two rather different mechanisms for handling
these exceptions. The first uses direct typesetting commands (for
example, \.{@@/} and \.{@@\#} for canceling and
introducing a line break, resp.) to change the typographic output.
The second (preferred) way is to supply {\it hidden context\/} to the
pretty-printer. Two commands, \.{@@;} and
\.{@@[}$\ldots$\.{@@]} are used for this purpose. The
former introduces a `virtual semicolon' that acts in every way like a
real one except it is not typeset (it is not output in the source file
generated by \CTANGLE\ either but this has nothing to do with pretty
printing, so I will not mention \CTANGLE\ anymore). For
instance, from the parser's point of view, if the preceding text was
parsed as a `scrap' of type {\it exp}, the addition of \.{@@;}
will make it into a `scrap' of type {\it stmt\/} in \CWEB's
parlance. The second construct (\.{@@[}$\ldots$\.{@@]}),
is used to create an {\it exp\/} scrap out of whatever happens to be
inside the brackets.
This is a powerful tool at the author's disposal. Stylistically,
such context hints are the right way to handle exceptions,
since using them forces the writer to emphasize the {\it logical\/}
structure of the formal text. If the pretty printing style is changed
later on, the texts with such hidden contexts should be able to
survive intact in the final document (as an example, using a break
after every statement in \Cee\ may no longer be considered
appropriate, so any forced break introduced to support this convention
would now have to be removed, whereas \.{@@;}'s would simply
quietly disappear into the background).
The same hidden context idea has another important advantage: with
careful grammar fragmenting (facilitated by \CWEB's or any other
literate programming tool's `hypertext' structure) and a more diverse
hidden context (or even arbitrary hidden text) mechanism, it is
possible to use a strict parser to parse incomplete language
fragments. For example, the productions that are needed to parse
\Cee's expressions form a complete subset of the grammar. If the
grammar's `start' symbol is changed to \prodstyle{expression} (instead of
the \prodstyle{translation-unit} as it is in the full \Cee\ grammar), a
variety of incomplete \Cee\ fragments can now be parsed and
pretty-printed. Whenever such granularity is still too `coarse',
carefully supplied hidden context will give the pretty printer enough
information to adequately process each fragment. A number of such {\it
sub}-parsers\namedspot{parser.stacks} can be tried on each fragment (this may sound
computationally expensive, however, in practice, a carefully chosen
hierarchy of parsers will finish the job rather quickly) until a
correct parser produced the desired output (this approach is similar
to, although not quite the same as the one employed by the {\it General LR
parsers}).
This somewhat lengthy discussion brings us to the question directly
related to the tools described in this manual: how does one provide
typographical hints or hidden context to the parser?
One obvious solution is to build such hints directly into the
grammar. The parser designer can, for instance, add new tokens
(say, \.{BREAK\_LINE}) to the grammar and extend the
production set to incorporate the new additions. The risk of
introducing new conflicts into the grammar is low (although not
entirely non-existent, due to the lookahead limitations of {\sc LR}($1$)
grammars) and the changes required are easy, although very tedious, to
incorporate.
In addition to being labor intensive, this solution has two other
significant shortcomings: it alters the original grammar and hides its
logical structure; it also `bakes in' the pretty-printing conventions
into the language structure (making the `hidden' context much less
`stealthy'). It does avoid the `synchronicity problem' mentioned
below.
A marginally better technique is to introduce a new regular expression
recognizable by the scanner which will then do all the necessary
bookkeeping upon matching the sequence. All the difficulties with
altering the grammar mentioned above apply in this case, as well, only
at the `lexical analysis level'. At a minimum, the set of tokens
matched by the scanner would have to be altered.
A much more satisfying approach, however, involves inserting the hints at the input stage and
passing this information to the scanner and the parser as part of the token
`values'. The hints themselves can masquerade as characters ignored by the scanner
(white space\footnote{Or even the `intercharacter space', to make the
hints truly invisible to the scanner.}, for example) and preprocessed by a specially designed
input routine. The scanner then simply passes on the values to the
parser. This makes hints, in effect, invisible.
The difficulty now lies in synchronizing the token production with the
parser. This subtle complication is very familiar to anyone who has
designed \TeX's output routines: the parser and the lexer are not
synchronous, in the sense that the scanner might be reading several
(in the case of the general {\sc LR}$(n)$ parsers) tokens\footnote{Even if
one were to somehow mitigate the effects of the lookahead {\it in the
parser\/}, the scanner would still have to read the characters of the
current token up to (and, in some cases, beyond) the (token's)
boundary which, in most cases, is
the whitespace, possibly hiding the next hint.} ahead of the
parser before deciding on how to proceed (the same way \TeX\ can
consume a whole paragraph's worth of text before exercising its page
builder).
If we simple-mindedly let the scanner return every hint it has encountered
so far, we may end up feeding the parser the hints meant for the token
that appears {\it after\/} the fragment the parser is currently working
on. In other words, when the scanner `backs up' it must correctly back
up the hints as well.
This is exactly what the scanner produced by the tools in this package
does: along with the main stream of tokens meant for the parser, it
produces two\footnote{There would be no difficulty in splitting either
of these streams into multiple `substreams' by modifying the stream
extraction macros accordingly.} hidden streams\namedspot{parser.streams}
(called the \.{\\yyformat} stream and
the \.{\\yystash} stream) and provides the parser with two
strings (currently only strings of digits are used although arbitrary
sequences of \TeX\ tokens can be used as pointers) with the promise
that {\it all the `hints' between the beginning of the corresponding
stream and the point labeled by the current stream pointer appeared
among the characters up to and, possibly, including the ones matched
as the current token}. The macros to extract the relevant parts of the
streams (\.{\\consumelistupto} and its cousins) are provided for the
convenience of the parser designer.
What the parser does with these pointers and the information they provide, is
up to the parser designer (the parsers for \flex\ and \bison\ syntax
in this package use somewhat different strategies).
The \.{\\yystash} stream currently collects all the typesetting commands inserted by
\CWEB\ to be possibly used in displaying the action code in \bison\
productions, for example. Because of this, it may appear in somewhat
unexpected places, introducing spaces where the programmer did not
neccessarily intend (such as at the beginning of the line, etc.). To
mitigate this problem, the \.{\\yystash} stream macros are implemented
to be entirely invisible to the lexer. Making them produce spaces is
also possible, and some examples are provided in \.{symbols.sty}.
The interested reader can consult the input routine macros in
\.{yyinput.sty} for the
details of the internal representation of the streams.
In the interest of full disclosure, it should be pointed out that this simple
technique introduces a significant strain on \TeX's
computational resources: the lowest level macros, the ones that handle
character input and are thus executed (in some cases multiple times), for
{\it every\/} character in the input stream are rather complicated and
therefore, slow. Whenever the use of such streams is not desired a simpler
input routine can be written to speed up the process (see
\.{\\yyinputtrivial} for a working example of such macro).
Finally, while probably not directly related to the present
discussion, this approach has one more interesting feature: after the
parser is finished, the parser output and the streams exist
`statically', fully available for any last minute postprocessing or for
debugging purposes, if necessary\footnote{One may think of the parser output
as an {\it executable abstract syntax tree\/} (\AST\ or \EAST\ if one
likes cute abbreviations, or even e\AST\ for an extra touch of modernity).
This parser feature is used to implement the facility that allows easy referencing of
productions in the section that implements the action code for one. See \.{yyunion.sty}
and the source of this file (\.{splint.w}) for details.}. Under most circumstances, the parser
output is `executed' and the macros in the output are the ones reading
the various streams using the pointers supplied at the parsing stage
(at least, this is the case for all the parsers supplied with the
package).
@*1 Inside semantic actions: switch statements and `functions' in \TeX.
So far we have looked at the lexer for your input, and a grammar ready to be put into
action (we will talk about actions a few moments later). It is time to discuss
how the tables produced by \bison\ get converted into \TeX\ {\it macros\/}
that drive the parser in {\it \TeX}.
The tables that drive the \bison\ input parsers
are collected in \.{\{b,d,f,g,n\}yytab.tex}, \.{small\_tab.tex} and other similarly named
files\footnote{Incidentally, one of the names above is not used anymore. See the \.{cweb}
directory after a successful build to find out which. Aren't footnotes great?!}. Each
one of these files contains the tables that implement a specific parser
used during different stages of processing.
Their exact function is well explained
in the source file produced by \bison\ ({\it how} this is done is
detailed elsewhere, see \cite[Ah] for a good reference). It would
suffice to mention here that there are three types of tables in this
file: \recount{1}numerical tables such as \.{\\yytable} and
\.{\\yycheck} (both are either \TeX's token registers in an
unoptimized parser or associate arrays in an optimized version of such
as discussed below),
\recount{2}a string array \.{\\yytname}, and \recount{3}an action
switch. The action switch is what gets called when the parser does a
{\it reduction}. It is easy to notice that the numerical tables come
`premade' whereas the string array consisting of token names
is difficult to recognize. This is intentional: this form of initialization
is designed to allow the widest range of
characters to appear inside names. The macros that do this reside in
\.{yymisc.sty}. The generated table files also contain
constant and token declarations used by the parser.
The description of the process used to output \bison\ tables in an
appropriate form continues in the section about
\locallink{bsfile}outputting \TeX\ tables\endlink, we pick it up here
with the description of the syntax-directed translation and the
actions. The line
$$
\.{\\switchon\\next\\in\\currentswitch}
$$
is responsible for calling an appropriate action in the current
switch, as is easy to infer. A {\it switch\/} is also a macro that
consists of strings of \TeX\ tokens intermixed with \TeX\ macros
inside braces. Each group of macros
gets executed whenever the character or the group of characters in
\.{\\next} matches a substring preceding the braced group. If there
are two different substrings
that match, only the earliest group of macros gets expanded.
Before a state is
used, a special control sequence,
\.{\\setspecialcharsfrom\\switchname} can be used to put the \TeX\
tokens in a form suitable for the consumption by \.{\\switchon}'s. The
most important step it performs is it {\it turns every token in the
list into a character with the same character code and category
12\/}. Thus \.{\\\{} becomes \.{\{}$_{12}$. There are other ways of
inserting tokens into a state: enclosing a token or a string of tokens in
\.{\\raw...\\raw} adds it to the state macro unchanged. If you have
a sequence of category 12 characters you want to add to the state, put
it after \.{\\classexpand} (such sequences are usually prepared by the
\.{\\setspecialchars} macro that uses the token tables generated by
\bison\ from your grammar).
You can give a case a readable label (say, \.{brackets}) and enclose
this label in \.{\\raw}$\ldots$\.{\\raw}. A word of caution: an
`\.{a}' inside of \.{\\raw}$\ldots$\.{\\raw} (which is most likely an
\.{a}$_{11}$ unless you played with the category codes before loading
the \.{\\switchon} macros) and the one outside it are two different
characters, as one is no longer a letter (category 11) in the eyes of
\TeX\ whereas the other one still is. For this reason one should not
use characters other than letters in h\.{\{}is\.{,}er\.{\}} state {\em
names\/} if such characters can form tokens by themselves:
the way a state picks an action does not distinguish between,
say, a `\.{(}' in `\.{(letter)}' and a stand alone `\.{(}' and may
pick an action that you did not intend\footnote{One way to mitigate
this is by putting such named states at the end of the switch, {\em
after\/} the actions labelled by the standalone characters. There is nothing
special about non-letter characters, of course. To continue the \.{letter}
example, placing a state named \.{let} {\em after\/} the \.{letter} one
will make it essentially invisible to the switch macros for the reasons
explained before this footnote.}. This applies even if `\.{(}' is not among
the characters explicitly
inserted in the state macro: if an action for a given character is not
found in the state macro, the \.{\\switchon} macro will insert a
current \.{\\default} action instead, which most often you would want
to be \.{\\yylex} or \.{\\yyinput} (i.e.\ skip this token). If a
single `\.{(}' or `\.{)}' matches the braced group that follows
`\.{(letter)}' chaos may ensue (most likely \TeX\ will keep reading
past the \.{\\end} or \.{\\yyeof} that should have terminated the
input). Make the names of character categories as unique as possible:
the \.{\\switchon} is simply a string matching mechanism, with the
added differentiation between characters of different categories.
Finally, the construct \.{\\statecomment}{\it
anything\/}\.{\\statecomment} allows you to insert comments in the
state sequence (note that the state {\it name\/} is put at the
beginning of the state macro (by \.{\\setspecialcharsfrom})
in the form of a special control sequence
that expands to nothing: this elaborate scheme is needed because
another control sequence can be \.{\\let} to the state macro which
makes the debugging information difficult to decipher). The debugging
mode for the lexer implemented with these macros is activated by
\.{\\tracedfatrue}.
The functionality of the \.{\\switchon} (as well as the
\.{\\switchonwithtype}, which is capable of some rudimentary type
checking) macros has been implemented in a number of other macro
packages (see \cite[Fi] that discusses the well-known and widely used
\.{\\CASE} and \.{\\FIND} macros). The macros in this collection have
the additional property that the only assignments that persist after
the \.{\\switchon} completes are the ones performed by the user code
inside the selected case.
This last property of the switch macros is implemented using another
mechanism that is part of this macro suite: the `subroutine-like'
macros, \.{\\begingroup}$\ldots$\.{\\tokreturn}. For examples, an
interested reader can take a look at the macros included with the
package. A typical use is
\.{\\begingroup}$\ldots$\.{\\tokreturn\{\}\{\\toks0 \}\{\}} which will
preserve all the changes to \.{\\toks0} and have no other side effects
(if, for example, in typical \TeX\ vernacular, \.{\\next} is used
to implement tail recursion inside the group, after the
\.{\\tokreturn}, \.{\\next} will still have the same value it
had before the group was entered). This functionality comes at the
expense of some computational efficiency.
This covers most of the routine computations inside semantic actions,
all that is left is a way to `tap' into the stack automaton
built by \bison\ using an interface similar to the special
\.{\$$n$} variables utilized by the `genuine' \bison\ parsers
(i.e.\ written in \Cee\ or any other target language supported by
\bison).
This r\^ole is played by the several varieties of \.{\\yy$\,p$} command
sequences (for the sake of completeness, $p$ stands for one of \.{($n$)},
\.{[{\rm name}]}, \.{]{\rm name}[} or $n$, here $n$ is a
string of digits, and a `name' is any name acceptable as a symbolic
name for a term in \bison). Instead
of going into the minutia of various flavors of \.{\\yy}-macros, let me
just mention that one can get by with only two `idioms' and still
be able to write parsers of arbitrary sophistication:
\.{\\yy($n$)} can be treated as a token register containing the
value of the $n$-th term of the rule's right hand side, $n>0$. The left
hand side of a production is accessed through \.{\\yyval}. A
convenient shortcut is \.{\\yy0\{{\rm \TeX\space material}\}} which
will expand (as in \.{\\edef}) the `\TeX\ material' inside the braces and store the result
in \.{\\yyval} (note that this only works if a sequence of \.{0}s, possibly followed or preceeded
by spaces are the only tokens between \.{\\yy} and the opening braces; see the discussion of
\.{\\bb} macros below for a description of what happens in other cases). Thus, a simple way
to concatenate the values of the first two production terms is
\.{\\yy0\{\\the\\yy(1)\\the\\yy(2)\}}. The included \bison\
parser can also be used to provide support for `symbolic names',
analogous to \bison's \.{{\$}[{\rm name}]} but a
bit more effort is required on the user's part to initialize such support.
Using symbolic names can make the parser more readable and maintainable,
however.
There is also a \.{\\bb$\,n$} macro, that has no analogue in the
`real' \bison\ parsers, and provides access to the term
values in the `natural order' (e.g.~\.{\\bb1} is the last term in the part of
the production preceeding the action). Its intended use is with the `inline'
rules (see the main parser for
such examples). As of version \.{3.0} \bison\ no longer outputs
|yyrhs| and |yyprhs|, which makes it impossible to produce the
|yyrthree| array necessary for processing such rules in the `left to right'
order. One might also note that the new notation is better suited for
the inline rules since the value that is pushed on the stack is that
of \.{\\bb0}, i.e.~the term implicitly inserted by \bison. Be aware
that there are no \.{\\bb[$\cdot$]} or \.{\\bb($\cdot$)} versions of
these macros, for obvious reasons\footnote{The obvious reason is the mechanism
used by \.{\\yy[$\cdot$]} and \.{\\yy($\cdot$)} to make them expandable.
These macros are basically references to the appropriate token registers. Since
the \.{\\bb} macros were designed for the environment where \.{\\yylen} is unknown,
establishing such references before the action is executed is not possible. A
less obvious reason is the author's lazy approach.}. A less obvious feature of this
macro is its `nonexpandable' nature. This means they cannot be used
inside \.{\\edef} (just like its \.{\\yy$\,n$} counterpart, it makes several
assignments which will not be executed by \.{\\edef}).
Thus, the most common use pattern is \.{\\bb$\,n$\{\\toks$\,m$\}} (where $n>0$)
with a subsequent expansion of \.{\\toks$\,m$}\footnote{Similar to how \.{\\yy$\,n$} macros
work, whenever $n>0$, this pattern simply puts the contents of the braced group
that follows in front of a (braced) single expansion of the appropriate
value on the stack. If, as in the example above, the contents of the
braced group are \.{\\toks$\,m$}, this effectively stores the value of the braced group in
the token register. In the case $n=0$ the meaning is different:
the stack value {\em corresponding to the implicit term\/} becomes the expanded (by \.{\\edef})
contents of the braced group following \.{\\bb$\,n$}.}.
Making these macros expandable is certainly possible
but does not seem crucial for the intended limited use pattern.
Finally, the scripts included with \splint\ include a postprocessor
(see the appropriate~\.{Makefile} for further details) that allows the
use of the `native' \bison\ term references (i.e.\ of the form
\.{\char`\$}$\ldots$) to access the value stack\footnote{Incidentally,
\bison\ itself uses a preprocessor (\.{M4}) to turn its term references
into valid \Cee.}. Utilizing the postprocessor allows any syntax for
term references used by \bison\ to be used inside \.{TeX}$\ldots$ \Cee\ macros.
In this case a typical
idiom is \.{\\the\char`\$[term\_name]} to get the value of
\prodstyle{term_name}. While storing a new value for the term (i.e.\
modifying the value stack) is also possible, it is not very
straightforward and thus not recommended. This applies to all forms of
term references discussed above.
Naturally, a parser writer may need a number of other data
abstractions to complete the task. Since these are highly dependent on
the nature of the processing the parser is supposed to provide, we
refer the interested reader to the parsers included in the package as
a source of examples of such specialized data structures.
One last remark about the parser operation is worth making here:
the parser automaton itself does not make any \.{\\global}
assignments. This (along with some careful semantic action writing)
can be used to `localize' the effects of the parser operation and,
most importantly, to create `reentrant' parsers that can, e.g.\ call
{\it themselves\/} recursively.
@*1 `Optimization'.
\namedspot{optimization}By default, the generated parser and scanner
keep all of their tables
in separate token registers. Each stack is kept in a single macro (this
description is further complicated by the support for parser {\it
namespaces\/} that exists even for unoptimized parsers but this
subtlety will not be mentioned again---see the macros in the package
for further details). Thus, every time a table
is accessed, it has to be expanded making the table access latency
linear in {\it the size of the table}. The same holds for stacks and
the action `switches', of
course. While keeping the parser tables (which are immutable) in token
registers does not have any better rationale than saving the control
sequence memory (the most abundant memory in \TeX), this way of
storing {\it stacks} does have an advantage when multiple parsers get
to play simultaneously. All one has to do to switch from one parser to
another is to save the state by renaming the stack control sequences.
When the parser and scanner are `optimized', all these control
sequenced are `spread over' appropriate associative arrays. One caveat
to be aware of: the action switches for both the parser and the scanner
have to be output differently (a command line option is used to
control this) for optimized and unoptimized parsers. While it is
certainly possible to optimize only some of the parsers (if your
document uses multiple) or even only some {\it parts\/} of a given
parser (or scanner), the details of how to do this are rather
technical and are left for the reader to discover by reading the
examples supplied with the package. At least at the beginning it is
easier to simply set the highest optimization level and use it
consistently throughout the document.
@*1 {\it \TeX\/} with a different {\sl slant} or do you C an escape?.
Some \TeX\ productions below probably look like alien script.
The authors of \cite[Er] cite a number of reasons to view pretty printing of
\TeX\ in general as a nearly impossible task. The macros included with
the package follow a very straightforward strategy and do not try to
be very comprehensive. Instead, the burden of presenting \TeX\ code in
a readable form is placed on the programmer. Appropriate hints can be
supplied by means of indenting the code, using assignments ($=$) where
appropriate, etc. If you would rather look at straight \TeX\
instead, the line \.{\\def\\texnspace\{other\}} at the beginning of
this section can be uncommented and
{\let\writetexidxentry\writetextxtidxentry
|TeX_( "/noexpand/inmath{/yy0{/yy1{}}}" );|} becomes {%
\def\texnspace{other}%
\def\texispace{other}% for the index
\let\writetexidxentry\writetextxtidxentry % for the index appearance
|TeX_( "/noexpand/inmath{/yy0{/yy1{}}}" );|}.
There is, however, more to this story. A look at the actual file will
reveal that the line above was typed as
$$
\.{TeX\_( "/noexpand/inmath\{/yy0\{/yy1\{\}\}\}" );}
$$
The `escape character' is leaning the other way!
The lore of \TeX\ is uncompromising: `\.{\\}' is {\it the\/} escape
character. What is the reason to avoid it in this case?
The mystery is not very deep: `\.{/}' was chosen as an escape character
by the parser macros (a quick glance at \.{?yytab.tex} will reveal as
much). There is, of course, nothing sacred (other than tradition,
which this author is trying his hardest to follow) about what character code
the escape character has. The reason to look for an alternative is straightforward: `\.{\\}' is
a special character in \Cee, as well (also an `escape', in fact). The line
\.{TeX\_( "..." );} is a {\it macro-call\/} but $\ldots$ in \Cee. This
function simply prints out (almost `as-is') the line in
parenthesis. An attempt at \.{TeX\_( "\\noexpand" );} would result in
\numberlinestrue
\begindemo
^
^oexpand
\enddemo
\numberlinesfalse
Other escape combinations\footnote{Here is a full list of {\it
defined\/} escaped characters in \Cee: \.{\\a}, \.{\\b}, \.{\\f}, \.{\\n},
\.{\\r}, \.{\\t}, \.{\\v}, \.{\\}{$[$\it octal digit\/$]$}, \.{\\'},
\.{\\"}, \.{\\?}, \.{\\\\}, \.{\\x}, \.{\\u}, \.{\\U}. Note that the
last three combinations must be followed by a specific string of
characters to appear in the input without generating errors.} are
even worse: most are simply undefined. If anyone feels trapped without
an escape, however, the same line can be typed as
$$
\.{TeX\_( "\\\\noexpand\\\\inmath\{\\\\yy0\{\\\\yy1\{\}\}\}" );}
$$
Twice the escape!
If one were to look even closer at the code, another oddity stands
out: there are no \.{\$}'s anywhere in sight.
The big money, \.{\$} is a beloved character in
\bison. It is used in action code to reference the values of the
appropriate terms in a production. If mathematics pays your bills, use
\.{\\inmath} instead.
@i bo.x
@i lo.x
@i fo.x
@i so.x
@i np.x
@i common.w
@i bs.w
@i fk.w
@i philosophy.w
@i checklists.w
@i references.w
\let\hostparsernamespace\mainnamespace % to typeset examples in the text
% properly
@**Index. \global\let\secrangedisplay\empty% do not show the current section range anymore
This section is, perhaps, the most valuable product of
\CWEB's labors. It lists references to definitions (set in {\it
italic}) as well as uses for each \Cee\ identifier used in the
source. Special facilities have been added to extend the indexing to
\bison\ grammar terms, \flex\ regular expression names and state names, as well as
\flex\ options,
and \TeX\ control sequences encountered in
\bison\ actions. Definitions of tokens (via \prodstyle{\%token},
\prodstyle{\%nterm} and \prodstyle{\%type} directives) are
%$\underline{\hbox{underlined}}$
typeset in {\bf bold}.
The \bison\ and \TeX\ entries are put
in distinct sections of the index in order to keep the separation
between the \Cee\ entries and the rest. It may be worth noting that
the {\it definition\/} of the symbol is listed under both its `macro
name' (such as \.{CHAR}, typeset as \prodstyle{CHAR} in the case of
the grammar below), as well as its `string' name if present (to
continue the previous example, \.{"char"} is synonymous with
\prodstyle{CHAR} after a declaration such as `\prodstyle{\%token}
\prodstyle{CHAR} \.{"char"}'), while the {\it use\/} of the term lists
whichever token form was referenced at the point of use (both forms
are accessible when the entry is typeset for the index and a macro can
be written to mention the other form as well). The original syntax of
\bison\ allows the programmer to declare tokens such as
\prodstyle{'\{'} and \prodstyle{'\}'} and the indexing macros honor
this convention even though in a typeless environment such as the
one the present typesetting parser is operating in such declarations
are redundant. The indexing macros also record the use of such
character tokens. The quotes indicate
that the `string' form of the token's name was used. A section set in
{\it italic\/} references the point where the corresponding term
appeared on the left hand side of a production. A production:
\let\TeXx\TeXxi
\def\gatoks{%
\omit\hfil&\omit\hfil&\omit\hfil\hbox to2em{\hfil}&\omit\hfil\cr
\noalign{\vskip-\baselineskip}%
}%
\beginmprod
left_hand_side:
term.1 term.2 term.3 \{\stashed{|TeX_("/do/something/with/yy(1)");|}\}
\endmprod
inside the \TeX\ part of a \CWEB\ section will generate several
index entries, as well, including the entries for any
material inside the action, mimicking \CWEB's behavior for the
{\it inline \Cee\/} (\.{\yl}$\ldots$\.{\yl}). Such entries (except for
the references to the \Cee\ code inside actions) are labeled with $^\circ$,
to provide a reminder of their origin.
This parser collection, as well as the indexing facilities therein have been
designed to showcase the broadest range of options available to the user
and thus it does not always exhibit the most sane choices one could make (for
example, using a full blown parser for term {\it names\/} is poor
design but it was picked to demonstrate multiple parsers in one
program). The same applies to the way the index is constructed (it
would be easy to only use the `string' name of the token if
available, thus avoiding referencing the same token twice).
\TeX\ control sequences are listed following the index of all \bison\ and \flex\
entries. The different sections of the index are separated by a {\it dinkus\/}
(\dinkus)@^dinkus (\dinkus)@>. Since it is nearly impossible to determine at what point a
\TeX\ macro is defined (and most of them are defined outside of the
\CWEB\ sources), only their uses are listed (to be more precise, {\it
every\/} appearance of a macro is assumed to be its use). In a few cases, a
`graphic' representation for a control sequence appears in the index (for
example, \texref{/getfirst} represents
{\let\writetexidxentry\writetextxtidxentry
\def\texnspace{other}\def\texispace{other}\inlineTeXx{/getfirst}$\!$}).
The index entries are ordered alphabetically. The
latter may not be entirely obvious in the cases when the `graphical
representation' of the corresponding token manifests a significant
departure from its string version (such as \texref{/yy(1)}
instead of {\def\texnspace{other}\def\texispace{other}%
\let\writetexidxentry\writetextxtidxentry
|TeX_("/yy(1)");|$\!$}). Incidentally, for the examples on this page
(as well an example in the section about \TeX\ pretty-printing) both
the `graphic' as well as `text' versions of the control sequence are
indexed. It is instructive to verify that their location in the index
corresponds to the `spelling' of their visual representation (thus,
\texref{/getfirst} appears under `p'). One should also be aware that
the indexing of some terms has been suppressed, since they appear too often.
% TODO: explain the visibility system. Note the anomalous order of \prodstyle{term.1}
% vs.~\prodstyle{term0} due to the dot in \.{term.1}, which is otherwise invisible. Underscore the
% importance of following a consistent naming scheme, including the `stringized' versions
% of token names.
@q Include the list of index section markers; this is a hack to get around @>
@q the lack of control over the generation of \CWEB's index; the correct order @>
@q of index entries depends on the placement of this inclusion @>
@i alphas.hx
\makeunindexablex{{\csstring\the}{\csstring\nx}{\csstring\yy}{\csstring\yylexnext}%
{\csstring\else}{\csstring\fi}{\csstring\yyBEGIN}{\csstring\next}}