orbiblex bibfile1 bibfile2 bibfile3 ... >outfile
The companion bibunlex(1) program can be used to recombine such a token stream back into a BibTeX file.
Scribe-format bibliography files can be handled as well if they are first converted to BibTeX form by bibclean(1).
Only minimal checks are made on the correctness of the input stream, and biblex will happily carry out a lexical analysis of nonsensical input, without issuing warnings or errors of any kind, other than possible internal string buffer overflow. To verify that biblex's output token stream is meaningful, the input files can be given to bibparse(1) for parsing analysis according to a proposed grammar for BibTeX.
<token-number><tab><token-name><tab>"<token-value>"
Each output line contains a single complete token, identified by a small integer number for use by a computer program, a token type name for human readers, and a string value in quotes.
Special characters in the token value string are represented with ANSI/ISO Standard C escape sequences, so all characters other than NUL are representable, and multi-line values can be represented in a single line.
Here are the token numbers and token type names that can appear in the output:
0 UNKNOWN 1 ABBREV 2 AT 3 COMMA 4 COMMENT 5 ENTRY 6 EQUALS 7 FIELD 8 INCLUDE 9 INLINE 10 KEY 11 LBRACE 12 LITERAL 13 NEWLINE 14 PREAMBLE 15 RBRACE 16 SHARP 17 SPACE 18 STRING 19 VALUE
Programs that parse such output should also be prepared for lines beginning with the warning prefix, %%, or the error prefix, ??, and for ANSI/ISO Standard C line number directives of the form
# line 273 "texbook1.bib"which record the line number and file name of the current input file.
As an example of the use of biblex, the UNIX command pipeline
will extract a sorted list of all citation keys in the file mylib.bib.biblex mylib.bib | \ awk '$2 == "KEY" {print $3}' | \ sed -e 's/"//g' | \ sort
The LITERAL token type is used for arbitrary text that biblex does not examine further, such as the contents of a @Preamble{...} or a @Comment{...}.
The UNKNOWN token type should never appear in the output stream. It is used internally to initialize token type variables.
Older versions of lex(1) are not 8-bit clean; they will not reliably handle characters 128--255. This latter deficiency is being remedied by the X/Open Consortium activities to internationalize and standard UNIX applications.
Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Email: beebe@math.utah.edu, beebe@acm.org, beebe@computer.org (Internet) WWW URL: http://www.math.utah.edu/~beebe Telephone: +1 801 581 5254 FAX: +1 801 581 4148