%% /u/sy/beebe/tex/bibcheck/README, Sat Dec 17 16:42:04 1994
%% Edit by Nelson H. F. Beebe <beebe@plot79.math.utah.edu>

========
CONTENTS
========

This directory contains bibcheck, a tool for applying heuristic checks
to BibTeX bibliography files.  The contents are:

CHANGELOG	revision history log
Makefile	UNIX makefile
README		this file
bibcheck.awk	bibcheck prototype in awk
bibcheck.c	bibcheck program in C (also compilable with C++)
bibcheck.hlp	ASCII text file with formatted manual pages in VAX VMS
		HELP format
bibcheck.man	manual pages (nroff/troff input)
bibcheck.ps	PostScript version of typeset manual pages
bibcheck.sh	bibcheck shell script template to run bibcheck.awk
		(automatically customized to local site by "make install")
bibcheck.sok	spelling exception dictionary for "make spell"
bibcheck.txt	ASCII text file with formatted manual pages
biblex.c	lex output from biblex.l
biblex.l	Lex program for parsing BibTeX files rigorously
bibyydcl.h	Function prototypes for lex and biblex functions
hash.c		hash table support
hash.h
regexp/*	regular-expression support
strdup.c	string primitive
stricm.c	string primitive
strnic.c	string primitive
typedefs.h	support header file
unixlib.h	support header file
xalloc.c	support header file
xalloc.h	support header file
xctype.h	support header file
xerrno.h	support header file
xstat.h		support header file
xstddef.h	support header file
xstdlib.h	support header file
xstring.h	support header file
xtypes.h	support header file
man2ps		UNIX shell script for directing conversion of nroff/troff
		files to PostScript
rofvms.awk	awk script to convert .txt file to .hlp file

Since several public and commercial implementations of nawk are
available for UNIX, IBM PC DOS, and DEC OpenVMS, this code should be
readily usable on most of the world's computers.


============
INSTALLATION
============

To build bibcheck, it is IMPERATIVE to use one of the machine-specific
targets in the Makefile, e.g. hp-9000-hpux-c++; on most systems,
particular compile-time options are required for successful
compilation.  If you want to use compiler optimization, set the
variable OPT on the command line, e.g.

	make hp-9000-hpux-c++ OPT='+O3'

To build a specific target, you can set the TARGETS variable, like this:

	make hp-9000-hpux-c++ TARGETS=biblex.i

[The name used to be TARGET, but this caused problems with Cray's
make, where it is a predefined name used to select target
architectures.]

On SGI IRIX 4.0.x, with "make sgi-mips-irix-cc", I had to manually
compile biblex.c without the -ansiposix switch, because with it, the
compiler complained about code generated by lex that is hard-coded
into the lex executable, and therefore, immutable.  In general, the
conformance of lex-produced C code to the 1989 ANSI/ISO C Standard is
low on most UNIX systems, and numerous warnings are to be expected
from compilation of biblex.c.

On DECstation ULTRIX 4.3, "make dec-mips-ultrix-g++" failed because
the lex output in biblex.c is not C++-conformant.  ULTRIX development
is frozen by DEC, so this problem will never be fixed.  Use
"make dec-mips-ultrix-gcc" or "make dec-mips-ultrix-cc" instead.

The version 0.08 release has been successfully built and minimally
tested on UNIX systems using these targets:

	cdc-mips-epix-cc
	cray-el94
	dec-alpha-osf1-c++
	dec-alpha-osf1-cc
	dec-alpha-osf1-g++
	dec-alpha-osf1-gcc
	dec-mips-ultrix-cc
	dec-mips-ultrix-gcc
	hp-9000-hpux-c++
	hp-9000-hpux-cc
	ibm-rs6000-aix-cc
	ibm-rs6000-aix-c++
	ibm-rs6000-aix-g++
	ibm-rs6000-aix-gcc
	next-motorola-mach-cc
	next-motorola-mach-gcc
	sgi-mips-irix-c++
	sgi-mips-irix-cc
	sgi-mips-irix-g++
	sgi-mips-irix-gcc
	sun-sparc-solaris2-g++
	sun-sparc-solaris2-gcc
	sun-sparc-solaris2-c++
	sun-sparc-solaris2-cc
	sun-sparc-solaris2-lcc
	sun-sparc-sunos4-cc
	sun-sparc-sunos4-gcc
	sun-sparc-sunos4-g++
	sun-sparc-sunos4-lcc

Compilation on the Cray EL94 with C++ (CC) fails because UNICOS' lex
does not produce code that is acceptable to C++.


===========
PERFORMANCE
===========

The C implementation was based on the version 0.07 prototype in awk.
bibcheck.c is 3.5 times as long as bibcheck.awk.  When the hash table
and regular expression support code, and header files, is included,
the C code total rises to 8482 lines, compared to 378 lines of awk, a
factor of 22.4.

The C version is faster: on jacm.bib (a bibliography of the Journal of
the ACM), it runs 3.04 times faster than the stream

	biblex <../bib/jacm.bib | time nawk -f bibcheck.awk  >foo.old

on a Sun SPARCstation LX entry-level workstation running Solaris 2.3,
using Sun C++ copilation with -O4 (highest) optimization.  On a
high-end HP 9000/735 with C++ +O3 compilation on HP-UX 9.0, the
speedup is only 1.51.  On an entry-level DEC Alpha 3000/300LX system
with C++ -O2 compilation on OSF/1 3.0, the speedup is 1.85.

Profiling of the C implementation shows that major portions of time
are spent in regexec() (and its descendants) and strchr(), neither of
which can be sped up much.  The author of the regexp package used in
bibcheck has already spent a good deal of effort optimizing the code,
particularly for the common cases of simple regular expressions.

Here is part of a profile from the HP 9000/735 compilation using
jacm.bib as test input:

%time cumsecs seconds   calls   msec/call  name
 34.0   15.10   15.10                     _mcount
 26.5   26.87   11.78  5329777       0.00 regmatch(char*)
 10.5   31.54    4.67  5238794       0.00 _strchr
  8.9   35.49    3.96 12323351       0.00 regnext(char*)
  7.9   38.98    3.48  4324771       0.00 regtry(regexp*,const char*)
  2.5   40.11    1.13   842028       0.00 _strlen
  1.5   40.79    0.68   226627       0.00 regexec(regexp*,const char*)
  1.5   41.45    0.67   186771       0.00 yylook
  0.8   41.80    0.35   368481       0.00 stricmp(const char*,const char*)
  0.6   42.08    0.28   695279       0.00 regrepeat(char*)
  0.5   42.28    0.20   470573       0.00 next_char(void)
  0.4   42.47    0.19    65304       0.00 hash_lookup(const char*,hash_table*)
  0.3   42.61    0.14    65304       0.00 hash(const char*,const hash_table*)
  0.3   42.73    0.12    27152       0.00 _doprnt
  0.3   42.85    0.12    15750       0.01 out_string(void)
  0.2   42.96    0.11   186771       0.00 yylex
  0.2   43.05    0.09   360056       0.00 __toupper
  0.2   43.12    0.07      122       0.57 read
...

The _mcount function is part of the profiling software; it usually
accounts for the largest fraction of time.

=================
CODE OPTIMIZATION
=================

An experiment was made with five different C and C++ compilers on a
Sun SPARCstation LX running Solaris 2.3, to see what the effect of
code optimization might be.  All compilers are recent releases (late
fall, 1994):
	gcc	2.6.0
	g++	2.6.0
	cc	3.0.1
	c++	3.0.1
	lcc	3.1

Here are the results, sorted in order of increasing CPU time, using
two large bibliographies (you need a display 150 characters wide to
view these tables).  Where possible, procedure inlining was requested
for functions known from profiling to be important, and for all but
lcc, code generation was requested for the more recent SPARC Version 8
architecture, which added integer multiply and divide instructions.

Two additional tests, indicated below with *********, were made with
elimination of the -mv8 option of the fastest case; the loss of
integer multiply and divide instructions slows the code by about 7%.

Finally, two tests, indicated below with #########, were made with
elimination of function inlining; this slows the code by about 2%.

======================================================================================================================================================
----------Time (sec)----------			cacm.bib (1548KB, 43807 lines, 2699 bibliographic entries)
real	user	sys   user+sys  make command
======================================================================================================================================================
157.7	156.7	0.3	157	make OPT=-O2\ -finline-functions\ -mv8 CC='gcc -D__solaris'
157.8	156.7	0.3	157	make OPT=-O3\ -finline-functions\ -mv8 CC='gcc -D__solaris'
160.2	159.1	0.4	159.5	make OPT=-O1\ -finline-functions\ -mv8 CC='gcc -D__solaris'
160.1	159.0	0.5	159.5	make OPT=-O2\ -mv8 CC='gcc -D__solaris' #########
168.8	167.3	0.3	167.6	make OPT=-O1\ -finline-functions\ -mv8 CC='g++ -D__solaris -D__EXTERN_C__'
168.1	167.1	0.5	167.6	make OPT=-O2\ -finline-functions CC='gcc -D__solaris' *********
169.4	168.2	0.4	168.6	make OPT=-xO2\ -xcg92\ -xinline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='cc -Xc -D__ACC__ -D__solaris'
175.4	174.2	0.3	174.5	make OPT=-xO3\ -xcg92\ -inline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='CC -D__solaris -D__EXTERN_C__'
176.5	175.2	0.4	175.6	make OPT=-xO2\ -xcg92\ -inline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='CC -D__solaris -D__EXTERN_C__'
182.5	181.4	0.3	181.7	make OPT=-O3\ -finline-functions\ -mv8 CC='g++ -D__solaris -D__EXTERN_C__'
182.6	181.4	0.3	181.7	make OPT=-O2\ -finline-functions\ -mv8 CC='g++ -D__solaris -D__EXTERN_C__'
189.4	188.5	0.3	188.8	make OPT=-xO3\ -xcg92\ -xinline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='cc -Xc -D__ACC__ -D__solaris'
213	211.9	0.3	212.2	make CC='lcc -A -A -D__solaris'
250.7	249.3	0.3	249.6	make OPT=-g\ -finline-functions\ -mv8 CC='gcc -D__solaris'
259.8	258.4	0.4	258.8	make OPT=-xO1\ -xcg92\ -xinline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='cc -Xc -D__ACC__ -D__solaris'
292.4	291.1	0.3	291.4	make OPT=-g\ -finline-functions\ -mv8 CC='g++ -D__solaris -D__EXTERN_C__'
356.2	354.7	0.4	355.1	make OPT=-xO1\ -xcg92\ -inline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='CC -D__solaris -D__EXTERN_C__'
364.6	361.3	0.3	361.6	make OPT=-g\ -xcg92\ -xinline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='cc -Xc -D__ACC__ -D__solaris'
426.3	424.1	0.4	424.5	make OPT=-g\ -xcg92\ -inline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='CC -D__solaris -D__EXTERN_C__'
======================================================================================================================================================

======================================================================================================================================================
----------Time (sec)----------			jacm.bib (990KB, 30548 lines, 2045 bibliographic entries)
real	user	sys   user+sys  make command
======================================================================================================================================================
79.1	78.4	0.2	78.6	make OPT=-O2\ -finline-functions\ -mv8 CC='gcc -D__solaris'
79.3	78.5	0.2	78.7	make OPT=-O3\ -finline-functions\ -mv8 CC='gcc -D__solaris'
80.8	79.8	0.3	80.1	make OPT=-O1\ -finline-functions\ -mv8 CC='gcc -D__solaris'
81.0	80.2	0.3	80.5	make OPT=-O2\ -mv8 CC='gcc -D__solaris' #########
83.9	83.2	0.2	83.4	make OPT=-xO2\ -xcg92\ -xinline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='cc -Xc -D__ACC__ -D__solaris'
85.8	84.3	0.2	84.5	make OPT=-O1\ -finline-functions\ -mv8 CC='g++ -D__solaris -D__EXTERN_C__'
85.2	84.3	0.3	84.6	make OPT=-O2\ -finline-functions CC='gcc -D__solaris' *********
87.4	86.2	0.3	86.5	make OPT=-xO2\ -xcg92\ -inline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='CC -D__solaris -D__EXTERN_C__'
89.8	88.6	0.3	88.9	make OPT=-xO3\ -xcg92\ -inline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='CC -D__solaris -D__EXTERN_C__'
92.9	92	0.2	92.2	make OPT=-O3\ -finline-functions\ -mv8 CC='g++ -D__solaris -D__EXTERN_C__'
94.9	92	0.3	92.3	make OPT=-O2\ -finline-functions\ -mv8 CC='g++ -D__solaris -D__EXTERN_C__'
95.1	94.2	0.2	94.4	make OPT=-xO3\ -xcg92\ -xinline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='cc -Xc -D__ACC__ -D__solaris'
105.8	104.9	0.3	105.2	make CC='lcc -A -A -D__solaris'
121.9	121.1	0.2	121.3	make OPT=-g\ -finline-functions\ -mv8 CC='gcc -D__solaris'
135.2	133	0.3	133.3	make OPT=-xO1\ -xcg92\ -xinline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='cc -Xc -D__ACC__ -D__solaris'
138.9	137.7	0.3	138	make OPT=-g\ -finline-functions\ -mv8 CC='g++ -D__solaris -D__EXTERN_C__'
176.6	175.6	0.3	175.9	make OPT=-g\ -xcg92\ -xinline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='cc -Xc -D__ACC__ -D__solaris'
178.6	177.1	0.3	177.4	make OPT=-xO1\ -xcg92\ -inline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='CC -D__solaris -D__EXTERN_C__'
211.3	209	0.3	209.3	make OPT=-g\ -xcg92\ -inline=regmatch,regnext,regtry,strchr,strlen,yylook,stricmp CC='CC -D__solaris -D__EXTERN_C__'
======================================================================================================================================================
