/* ================================================================= *\

   bibindex -- a program to index bibtex files, used in conjunction
	       with biblook

   This program was specifically developed for use with the
   computational geometry bibliographic database, and assumes the
   rules thereof.  The database may be obtained by anonymous ftp
   from cs.usask.ca in the file "pub/geometry/geombib.tar.Z".

   Version 1.0 written by Jeff Erickson <jeff@ics.uci.edu>, 27 Mar 92
   Version 2.0 written by Jeff Erickson <jeff@ics.uci.edu>, 17 Jun 92

   This program is in the public domain.  You may use it or modify
   it to your heart's content, at your own risk.

   %Make% gcc -O -o bibindex bibindex.c

   Usage: bibindex bibfile [-i field ...]

   -----------------------------------------------------------------

   HOW IT WORKS:

   The bibtex file is read field by field.  The file offset beginning
   each record and each record's citation key are recorded.  A list of
   words is extracted from each field.  These words are placed into
   tables, which remember which records contain them in their
   respective fields.  Once the file has been completely read, the
   hash tables are compacted and sorted.

   The hash tables are extendible, since we have to maintain one for
   each possible field type, and static tables would be way too big.
   Initially, each table holds 1K entries, and the tables are doubled
   whenever they reach 50% capacity.  Each table entry is at least 24
   bytes.  If the resulting hash tables use too much memory, the
   entries should be changed to pointers, allocated on the fly.

   The entry lists associated with each word are implemented as
   extendible arrays.  Initially, each list holds eight entries.  If a
   new entry is inserted into a full list, the list is doubled first.

   The index file has the following format (loosely):

	version info
	# entries
	array of offsets into bib file	-- one per entry
	# field types
	array of field names		-- one per field type
	array of			-- one per field type
	    # words
	    array of			-- one per word
		word			-- in alphabetical order
		# locations
		array of entry #s	-- one per location

   There are advantages and disadvantages of having multiple hash
   tables instead of a single table.  I am starting with the premise
   that the lookup program should be very fast.  Consequently, I can't
   make it determine which fields contain a given word.  Doing so
   would require putting ALL of the entry-parsing code into the lookup
   program.  It would also mean potentially parsing a lot of
   extraneous entries to find relatively common words in relatively
   uncommon places (eg, "title edelsbrunner").

   If there were a single word table, each word list would have to
   include bitmasks to indicate the appropriate fields along with the
   entry numbers.  Assuming between 16 and 32 field types (the CG bib
   uses about 24), this would triple the size of each entry.  On the
   average, each word occurs in less than two field types.  The
   bitmask approach would also require knowledge of the field names in
   advance; the multiple table approach does not.

   -----------------------------------------------------------------
   VERSION HISTORY:

   1.0 <jge> 3/26/92	Initial version completed
   1.1 <jge> 3/27/92	Words restricted to letters only; special
			rules added for apostrophes, so that words
			like "didn't" and "O'Rourke" can be handled
			correctly.
   1.2 <jge> 3/30/92	Faster hash function; now compressing hash
			tables before sorting them.  Execution time on
			the CG bib reduced by a factor of thirty-five.
   1.3 <jge> 4/2/92	Toyed with the hash function some more, trying
			to reduce collisions, with limited success.
   1.4 <jge> 4/17/92	Added exit(0) at the end of main()  [I thought
			that was supposed to be automatic!]

   2.0 <jge> 6/12/92	First major revision completed.
	1. Major change in file format -- word tables for every
	   field instead of just author, title, and keywords.
	2. Word hash tables made extendible.
	3. Fixed bug in GetNextWord that would cause the function
	   to return inside nested braces.
	4. Fixed bug in MungeField that would kill everything in an
	   entry after an abbreviation.  Abbreviations now go into
	   the appropriate hash table with the other words.
	5. Made GetNextWord consider numbers and math expressions
	   to be words.
	6. Added WriteWord, resulting in 40% savings in disk space.
	7. Added -i flag and black holes.  Ignoring "oldlabel"
	   gives another 15% savings (6/92 version of CGbib).
   2.1 <jge> 7/9/92	Minor bug fixes.
   2.2 Nelson H. F. Beebe <beebe@math.utah.edu> 03-Oct-1992
        Testing with >100K lines of .bib files led to these changes:
	1. Add support for complete BibTeX keyword name syntax with
	   iskeychar() function.
	2. Add support for trailing comma after last key = "value" entry.
        3. Record input line number for display in error messages.
        4. Record initial line number of each BibTeX entry for
           display in error messages to better localize error.
	5. Add test for word buffer overflow in MungeField() to prevent
	   run-time core dumps, and increase supported word size from
	   15 to 31 (a word-length histogram of a 116921-word dictionary
	   found words up to 28 characters long, with 1% longer than 15).
	   File version increased to 2 because of word size change.
	6. Add typecasts in calls to qsort() and comparisons of
	   unsigned short with short, change main() from void to int,
	   remove variable from initializer of msg[2], and add void to
	   IndexBibFile() definition to allow compilation with C++ as
	   well as C for better compile-time checking.
	7. In MungeEntry(), do an ungetc() after key name collection.
	   Otherwise, a key="value" pair without a blank before the =
	   will not be recognized because the = read on lookahead has
	   not been put back.
	8. Revise table compression code in OutputTables() and
	   code in FreeTables() to avoid duplicate frees, which is
	   a fatal error on many systems, and specified to result
	   in undefined behavior by the 1989 ANSI/ISO C Standard.
	9. Define bcopy() as a macro to invoke standard memcpy()
	   instead.
       10. Include time.h, unistd.h, and malloc.h to get proper
           function declarations for library routines needed here.
       11. Add DEBUG_MALLOC support.
       12. Change several char* types to const char*.
       13. Correct some typographical errors in comment.
       14. Write UNIX manual pages for bibindex and biblook.
       15. Allow command-line to specify a filename with .bib extension.
       16. Add help support to biblook.
       17. Correct error in FreeTables() in biblook.c; inner loop 
           incremented i instead of j.

\* ================================================================= */


#include <stdio.h>
#if (__STDC__ || __cplusplus || c_plusplus)
#include <stdlib.h>
#endif /*  (__STDC__ || __cplusplus || c_plusplus) */
#include <string.h>
#include <ctype.h>
#include <time.h>
#if !__NeXT__
#include <unistd.h>
#include <malloc.h>
#endif /* !__NeXT__ */

#if ardent
#include <sys/types.h>
#endif

#if (__STDC__ || __cplusplus || c_plusplus)
#define VOID	void
#else /* NOT (__STDC__ || __cplusplus || c_plusplus) */
#define VOID
#endif /* (__STDC__ || __cplusplus || c_plusplus) */

#undef bcopy
#define bcopy(source,target,length) \
	memcpy((char*)(target),(const char*)(source),length)

#if sun
int	_filbuf(FILE *);		/* missing from stdio.h */
int	_flsbuf(unsigned char,FILE *);	/* missing from stdio.h */
#endif /* sun */

#if DEBUG_MALLOC
			/* For dynamic memory debugging. */
			/* Note that the parens around free and malloc */
			/* are essential to prevent macro expansion in */
			/* ANSI/ISO Standard C and C++.  Erroneous */
			/* preprocessors will go into an infinite loop */
			/* (e.g. SunOS /usr/CC/sun4/cpp) */
#if sun
int malloc_debug(int level);
int malloc_verify(void);
#else /* NOT sun */
#define malloc_debug(level)	level
#define malloc_verify()		1
#endif /* sun */
#undef free
#undef malloc
#define free(ptr) (malloc_verify(), \
		fprintf(stderr,"line %d: free(%p)\n",(int)__LINE__,(ptr)), \
		(free)(ptr))
static void *p__;			/* for malloc() debugging */
#define malloc(s) (malloc_verify(),\
		   p__ = (malloc)(s),\
		   fprintf(stderr,"line %d: malloc(%ld) -> %p\n",\
			   (int)__LINE__,(s),(p__)),\
		   p__)
#endif /* DEBUG_MALLOC */

#define FILE_VERSION  	2	/* file format version */
#define MAJOR_VERSION 	2	/* program version     */
#define MINOR_VERSION 	2

static long line_number = 1L;		/* for debug messages */
static long initial_line_number = 1L;


/* ======================= UTILITY FUNCTIONS ======================= */

/* ----------------------------------------------------------------- *\
|  void die(const char *msg1, const char *msg2) -- print an error message and die
\* ----------------------------------------------------------------- */
void die(const char *msg1, const char *msg2)
{
    fprintf(stderr,
    "\nError:\tin BibTeX entry starting at line %ld, \
error detected at line %ld:\n\t%s %s\n",
	    initial_line_number, line_number, msg1, msg2);
    exit(1);
}

/* ----------------------------------------------------------------- *\
|  char safegetc(FILE *fp, const char *where)
|
|  Get the next character safely.  Used by routines that assume that
|  they won't run into the end of file.
\* ----------------------------------------------------------------- */
char safegetc(FILE *fp, const char *where)
{
    char c;

    if (feof(fp))
	die("Unexpected end of file in", where);

    c = getc(fp);
    if (c == '\n')
	line_number++;
    return (c);
}

/* ----------------------------------------------------------------- *\
|  void *safemalloc(unsigned howmuch, const char *msg1, const char *msg2)
|
|  Allocate memory safely.  Used by routines that assume they won't
|  run out of memory.
\* ----------------------------------------------------------------- */
void *safemalloc(unsigned howmuch, const char *msg1, const char *msg2)
{
    register void *tmp = NULL;

    tmp = malloc(howmuch);
    if (tmp == NULL)
	die(msg1, msg2);

    return tmp;
}


/* ====================== HASH TABLE FUNCTIONS ===================== *\

   The hash tables start small and double whenever they reach 50%
   capacity.  Hashing is performed by going through the string one
   character at a time, multiplying by a constant and adding in the
   new character value each time.  The constant is defined to give
   the same even spread (about size/sqrt(2)) between successive
   multiples, as long as the hash table size is a power of two.

   Collisions are resolved by double hashing.  Since the hash table
   size is always a power of two, the secondary hash value has to be
   odd to avoid loops.

   The field tables are non-extendible hash tables, otherwise handled
   the same way.  It is probably well worth the effort to fine tune
   the field table hash function in order to avoid collisions.

   The field tables associated with ignored fields are black holes.
   Everything is the same, except that InsertEntry doesn't actually
   DO anything.

\* ================================================================= */

#define MAXWORD	  	31		/* maximum length of word indexed */
#define MAXFIELDS	64	 	/* intentionally way too much */

#define INIT_HASH_SIZE	256
#define HASH_CONST   	1482907		/* prime close to 2^{20.5} */

typedef char Word[MAXWORD+1];

typedef struct		/* Hash table entry */
{
    Word    theword;	/* the hashed word */
    short   number;	/* number of references in the list */
    short   size;	/* real size of reference list */
    short   *refs;	/* actual list of references */
} HashCell, *HashPtr;

typedef struct		/* Extendiable hash table */
{
    Word    thefield;	/* the field type */
    short   number;	/* number of words in the hash table */
    short   size;	/* real size of the hash table */
    HashPtr words;	/* the actual hash table */
} ExHashTable;

static ExHashTable fieldtable[MAXFIELDS];	/* the field tables */
static char        numfields;			/* number of fields */


/* ----------------------------------------------------------------- *\
|  void InitTables(void)
|
|  Initialize the field tables
\* ----------------------------------------------------------------- */
void InitTables(VOID)
{
    register unsigned short i;

    numfields = 0;
    for (i=0; i<MAXFIELDS; i++)
    {
	fieldtable[i].thefield[0] = 0;
	fieldtable[i].number = 0;
	fieldtable[i].size = 0;
	fieldtable[i].words = NULL;
    }
}

/* ----------------------------------------------------------------- *\
|  void InitOneField(ExHashTable *htable)
|
|  Initialize one field's hash table
\* ----------------------------------------------------------------- */
void InitOneField(register ExHashTable *htable)
{
    register unsigned short i;

    htable->number = 0;
    htable->size = INIT_HASH_SIZE;
    htable->words = (HashPtr) safemalloc(INIT_HASH_SIZE*sizeof(HashCell),
					 "Can't create hash table for",
					 htable->thefield);
    for (i=0; i<INIT_HASH_SIZE; i++)
    {
	htable->words[i].theword[0] = 0;
	htable->words[i].number = 0;
	htable->words[i].size = 0;
	htable->words[i].refs = NULL;
    }
}

/* ----------------------------------------------------------------- *\
|  void FreeTables(void)
|
|  Free the tables
\* ----------------------------------------------------------------- */
void FreeTables(VOID)
{
    register unsigned short i,j;

    for (i=0; i < (unsigned short)numfields; i++)
    {
	if (fieldtable[i].words)
	{
	    for (j=0; j < (unsigned short)(fieldtable[i].number); j++)
	    {
		if (fieldtable[i].words[j].refs)
		    free(fieldtable[i].words[j].refs);
	    }
	    free(fieldtable[i].words);
	}
    }
}


/* ----------------------------------------------------------------- *\
|  ExHashTable *GetHashTable(char *field)
|
|  Get the hash table associated with the given field.  If the table
|  is unclaimed, claim it and initialize it.
\* ----------------------------------------------------------------- */
ExHashTable *GetHashTable(char *field)
{
    register unsigned long hash=0;	/* primary hash value	*/
    register unsigned long skip=1;	/* secondary hash value */
    register int i;

    for (i=0; field[i]; i++)
    {
	hash = (hash*HASH_CONST + field[i]) % MAXFIELDS;
	skip += 2*hash;
    }

    while (fieldtable[hash].thefield[0] &&	     /* cell not empty and  */
	   strcmp(fieldtable[hash].thefield, field)) /* not the right field */
    {
	hash = (hash+skip) % MAXFIELDS;
    }

    if (!fieldtable[hash].thefield[0])
    {
	strcpy(fieldtable[hash].thefield, field);
	InitOneField(fieldtable+hash);
	numfields++;
	if (numfields > MAXFIELDS)
	    die("too many field names",field);
    }

    return fieldtable+hash;
}

/* ----------------------------------------------------------------- *\
|  void InitBlackHole(char *field)
|
|  Intitialize a black hole for the given field.
\* ----------------------------------------------------------------- */
void InitBlackHole(char *field)
{
    ExHashTable *hole;

    hole = GetHashTable(field);
    free(hole->words);
    hole->words = NULL;
}

/* ----------------------------------------------------------------- *\
|  HashPtr GetHashCell(ExHashTable *table, char *word)
|
|  Get the hash table cell associated with the given word.  If the cell
|  is unclaimed, claim it, initialize it, and update the table's word
|  count.
\* ----------------------------------------------------------------- */
HashPtr GetHashCell(ExHashTable *htable, char *word)
{
    register HashPtr table, cell;
    register unsigned long hash=0;	/* primary hash value	*/
    register unsigned long skip=1;	/* secondary hash value */
    register int i;

    table = htable->words;

    for (i=0; word[i]; i++)
    {
	hash = (hash*HASH_CONST + word[i]) % htable->size;
	skip += 2*hash;
    }

    while (table[hash].theword[0] &&		/* cell not empty and */
	   strcmp(table[hash].theword, word))	/* not the right word */
    {
	hash = (hash+skip) % htable->size;
    }
    cell = table+hash;

    if (!cell->theword[0])
    {
	strcpy(cell->theword, word);
	cell->size = 8;
	cell->refs = (short *) safemalloc(cell->size * sizeof(short),
					     "Can't create entry list for",
					     word);
	htable->number++;
    }
    return cell;
}

/* ----------------------------------------------------------------- *\
|  void ExtendHashTable(ExHashTable *htable)
|
|  Double the size of the hash table and rehash everything.
\* ----------------------------------------------------------------- */
void ExtendHashTable(ExHashTable *htable)
{
    register unsigned short i;
    register HashPtr newcell;
    register HashPtr oldtable;
    short oldsize;

    oldsize  = htable->size;
    oldtable = htable->words;

    htable->number = 0;
    htable->size *= 2;
    htable->words = (HashPtr) safemalloc(sizeof(HashCell)*htable->size,
					 "Can't extend hash table for",
					 htable->thefield);

    for (i=0; i < (unsigned short)(htable->size); i++)
    {
	htable->words[i].theword[0] = 0;
	htable->words[i].number = 0;
	htable->words[i].size = 0;
	htable->words[i].refs = NULL;
    }

    for (i=0; i< (unsigned short)oldsize; i++)
    {
	if (oldtable[i].theword[0])
	{
	    newcell = GetHashCell(htable, oldtable[i].theword);
	    *newcell = oldtable[i];
	}
    }

    free(oldtable);
}

/* ----------------------------------------------------------------- *\
|  void InsertEntry(ExHashTable *htable, char *word, short entry)
|
|  Insert the word/entry pair into the hash table, unless it's
|  already there.
\* ----------------------------------------------------------------- */
void InsertEntry(ExHashTable *htable, char *word, short entry)
{
    register HashPtr cell;
    short *newlist;

    if (htable->words == NULL)		/* Is it a black hole? */
	return;

    if (htable->number * 2 > htable->size)
	ExtendHashTable(htable);

    cell = GetHashCell(htable, word);

    if (cell->number && (cell->refs[cell->number-1] == entry))
	return;

    if (cell->number == cell->size)	/* expand the array */
    {
	cell->size *= 2;
	newlist = (short *) safemalloc(cell->size * sizeof(short),
				       "Can't extend entry list for",
				       word);

	bcopy(cell->refs, newlist, cell->number * sizeof(short));
	free(cell->refs);
	cell->refs = newlist;
    }
    cell->refs[cell->number++] = entry;
}


/* ================================================================= */

/* ----------------------------------------------------------------- *\
|  void WriteWord(FILE *ofp, char *word)
|
|  Output the word in "Pascal" string format -- length byte followed
|  by characters.  This saves some disk space over writing 16 bytes
|  in all cases.
\* ----------------------------------------------------------------- */
void WriteWord(FILE *ofp, char *word)
{
    unsigned char length = strlen(word);
    fwrite((void *) &length, sizeof(char), 1, ofp);
    fwrite((void *) word, sizeof(char), length, ofp);
}

/* ----------------------------------------------------------------- *\
|  void OutputTables(FILE *ofp)
|
|  Compress and output the tables, with lots of user feedback.
|  KLUDGE -- Passing strcmp to qsort assumes intimate knowledge of
|  the HashCell and ExhashTable structs.
\* ----------------------------------------------------------------- */
void OutputTables(FILE *ofp)
{
    register HashPtr words;
    register int i, j, k, count;

    printf("Writing index tables...");
    fflush(stdout);

    numfields = 0;		/* recount, ignoring black holes */
    for (i=0; i<MAXFIELDS; i++)
    {
	if (fieldtable[i].words)
	{
	    if (i > numfields)
	    {
		fieldtable[numfields] = fieldtable[i]; /* copy i-th table */
	        fieldtable[i].number = 0; /* then clear i-th table */
	        fieldtable[i].size = 0; /* to avoid duplicate free() later */
	        fieldtable[i].words = NULL;
	    }
	    numfields++;
	}
    }
    qsort(fieldtable, (size_t)numfields, sizeof(ExHashTable),
	  (int (*)(const void*,const void*))strcmp);

    fwrite((void *) &numfields, sizeof(char), 1, ofp);
    for (i=0; i<numfields; i++)
	WriteWord(ofp, fieldtable[i].thefield);

    printf("[%d fields]\n", numfields);

    for (k=0; k<numfields; k++)
    {
	printf("%2d: %s...", k+1, fieldtable[k].thefield);
	fflush(stdout);

	words = fieldtable[k].words;

	for (i=0, j=0; i<fieldtable[k].size; i++)
	{
	    if (words[i].theword[0])
	    {
		if (i > j)
		{
		    words[j] = words[i]; /* copy i-th table to j-th */
		    words[i].number = 0; /* then clear i-th table */
		    words[i].size = 0;	/* to avoid duplicate free() later */
		    words[i].refs = (short*)NULL;
		}
		j++;
	    }
	}
	qsort(words, (size_t)fieldtable[k].number, sizeof(HashCell),
	      (int (*)(const void*,const void*))strcmp);

	fwrite((void *) &(fieldtable[k].number), sizeof(short), 1, ofp);

	count = 0;
	for (i=0; i<fieldtable[k].number; i++)
	{
	    WriteWord(ofp, words[i].theword);
	    fwrite((void *) &(words[i].number), sizeof(short), 1, ofp);
	    fwrite((void *) words[i].refs, sizeof(short),
		   words[i].number, ofp);
	    count += words[i].number;
	}

	printf("[%d words, %d references]\n", fieldtable[k].number, count);
    }
}


/* ============================= INPUT ============================= *\

   I'd like this to work with more than just the CG bib, so I can't
   assume very much about the input.  In particular, all white space
   (blanks, tabs, and newlines) is identical most of the time.  On the
   other hand, it would be nice to include any "comments" that
   obviously go with an entry as part of that entry.  Consequently, two
   newlines in a row (possibly separated by other white space) is
   considered a break between entries.  This will give us bogus entries
   for stray "comments", but we can take care that easily enough -- an
   entry is only real if it contains a @ character.

\* ================================================================= */

#define NUMBAD 9
char badwords[NUMBAD][5] = {		/* words not to index */
    "an", "and", "for", "in", "of",
    "on", "the", "to", "with"
};

#define NONABBREV ",\n\t \"#%'()={}"	/* See LaTeX book, p.143 */

/* ----------------------------------------------------------------- *\
|  unsigned long FindNextEntry(FILE *ifp)
|
|  Return the file offset to the next entry in the bib file.  On exit,
|  the file pointer is left at the first comma.  Entries containing no
|  commas, such as @string, are skipped.  The entry begins after the
|  most recent blank line, the end of the last entry, or the beginning
|  of the file.  Returns "-1" if there is no next entry.
\* ----------------------------------------------------------------- */
unsigned long FindNextEntry(FILE *ifp)
{
    char ch;
    char blank = 0;	  	/* 1 if current line is blank so far */
    char braces = 0;
    char quotes = 0;
    unsigned long offset;

    offset = ftell(ifp);
    ch = getc(ifp);
    if (ch == '\n')
	line_number++;

    while (1)
    {
	if (feof(ifp))
	    return (unsigned long) -1;

	if (ch == '@')			/* got a potential entry */
	{
	    initial_line_number = line_number; /* record start of entry */
	    blank = 0;
	    while ((ch != '{') && (ch != '('))
		ch = safegetc(ifp, "FindNextEntry 1");	/* EOF is an */
							/* error here */
	    quotes = 0;
	    braces = 0;
	    ch = safegetc(ifp, "FindNextEntry 2");

	    while (quotes || braces ||
		   ((ch != ',') && (ch != '}') && (ch != ')')))
	    {
		if (ch == '{')
		    braces++;
		else if (ch == '}')
		    braces--;
		else if ((ch == '"') && !braces)
		    quotes = !quotes;

		ch = safegetc(ifp, "FindNextEntry 3");
	    }
	    if (ch == ',')
	    {
		ungetc(ch, ifp);
		return offset;
	    }
	    else
		offset = ftell(ifp);
	}
	else if (ch == '\n')
	{
	    if (blank)
		offset = ftell(ifp);
	    blank = 1;
	}
	else if (!isspace(ch))
	    blank = 0;
	ch = getc(ifp);
	if (ch == '\n')
	    line_number++;
    }
}

/* ----------------------------------------------------------------- *\
|  int GetNextWord(FILE *ifp, char *word)
|
|  Get the next word in the current field.  A word is any contiguous
|  set of letters and numbers, AFTER the following steps:
|	1. Letters are folded to lower case.  Thus, "Voronoi" is
|	   returned as "voronoi"
|	2. All TeX commands, except those in math expressions, are
|	   removed, but their arguments are left behind.  Thus,
|	   "Erd{\H o}ss" is returned as "erdos".
|	3. All other non-word characters are removed.  Non-word
|	   characters inside {{possibly} nested} braces or dollar
|          signs do not delimit words, so they may cause unexpected
|	   results.  Thus, "{this example}" is returned as
|	   "thisexample".
|	4. TeX commands in math expressions are considered normal
|	   text.  Thus, "$O(n\log^2 n)$" is returned as "onlog2n"
|	   instead of "onn".  This occasionally gives unexpected or
|	   unreadable results.  For example, "$\bigcup_1^n[a_i,b_i]$"
|	   is returned as "bigcup1naibi".
|	5. Apostrophes do not delimit words.  Thus, "didn't" is
|	   returned as "didnt", and "{\'O}'D{\'u}nlaing" is returned
|	   as "odunlaing".
|
|  Returns 1 if a word was actually found, 0 otherwise.
|
|  The file pointer is left at the character immediately following the
|  word.  The input is assumed to be syntactically correct: unbalanced
|  braces, math delimiters, and quotation marks will cause errors.
\* ----------------------------------------------------------------- */
int GetNextWord(FILE *ifp, char *word)
{
    char braces = 0;		/* levels of indented braces */
    char math = 0;		/* 1 if in math expression */
    char inword = 0;		/* 1 if word has started */
    char incmd = 0;		/* 1 if reading TeX command */
    char done = 0;		/* 1 if word is complete */
    char ch = ' ';
    int nchars = 0;		/* how many characters stored in word */

    while (!done)
    {
	ch = safegetc(ifp, "GetNextWord 1");

	if (isalpha(ch))		/* letters */
	{
	    if (!incmd)			/* may be part of TeX command */
	    {
		inword = 1;
		if (++nchars <= MAXWORD) /* ignore overflow in word buffer */
		    *word++ = tolower(ch);
	    }
	}
	else if (isdigit(ch))
	{
	    incmd = 0;
	    inword = 1;
	    if (++nchars <= MAXWORD) /* ignore overflow in word buffer */
	        *word++ = ch;
	}
	else if (ch == '\'')		/* ignore apostrophes entirely */
	    incmd = 0;
	else if (ch == '\\') 		/* beginning of TeX command */
	{
	    if (!math)			/* if math, pretend it's text */
	    {
		ch = safegetc(ifp, "GetNextWord 2");
		if (isalpha(ch))
		    incmd = 1;
	    }
	}
	else if (ch == '{')		/* left brace */
	{
	    incmd = 0;
	    braces++;
	}
	else if (ch == '}')		/* right brace */
	{
	    incmd = 0;
	    if (!braces)
	    {
		ungetc(ch, ifp);
		done = 1;
	    }
	    else
		braces--;
	}
	else if (ch == '"')		/* double quote */
	{
	    incmd = 0;
	    if (!braces)
	    {
		ungetc(ch, ifp);
		done = 1;
	    }
	}
	else if (ch == '$')		/* math */
	{
	    incmd = 0;
	    math = !math;
	    if (math)		/* treat $s as braces */
		braces++;
	    else
		braces--;
	}
	else if (incmd)			/* other characters */
	    incmd = 0;
	else if (inword && !braces)
	    done = 1;
    }

    *word = 0;
    return inword;
}

/* ----------------------------------------------------------------- *\
|  void MungeField(FILE *ifp, ExHashTable *htable, short entry)
|
|  Munge the current field.  Dump all words into the given hash table,
|  with the given entry number.  On exit, the file pointer is on the
|  comma or closing character for the entry.
\* ----------------------------------------------------------------- */
void MungeField(FILE *ifp, ExHashTable *htable, short entry)
{
    char ch, bad, i;
    char nextword[MAXWORD+1];

    ch = safegetc(ifp, "MungeField");
    while (isspace(ch))
	ch = safegetc(ifp, "MungeField");

    if (ch != '=')
	die("= expected in bibliography field", htable->thefield);

    ch = safegetc(ifp, "MungeField");
    while (isspace(ch))
	ch = safegetc(ifp, "MungeField");

    if (ch != '{' && ch != '"') 	/* handle abbreviations */
    {
	for (i=0; !strchr(NONABBREV, ch); i++)
	{
	    if (i >= MAXWORD)
		die("word buffer overflow", htable->thefield);
	    nextword[i] = tolower(ch);
	    ch = safegetc(ifp, "MungeField");
	}
	nextword[i] = 0;
	nextword[MAXWORD] = 0;
	InsertEntry(htable, nextword, entry);
    }
    else
    {
	while (GetNextWord(ifp, nextword))
	{
	    if (nextword[1] == 0)	/* ignore single letters */
		continue;
	    for (bad=0, i=0; !bad && i<NUMBAD; i++)
		if (!strcmp(nextword, badwords[i]))
		    bad = 1;
	    if (!bad)
	    {
		nextword[MAXWORD] = 0;		/* truncate to fit */
		InsertEntry(htable, nextword, entry);
	    }
	}
	ch = safegetc(ifp, "MungeField");	/* close quote/brace */
	ch = safegetc(ifp, "MungeField");	/* comma (maybe) */
    }

    while (isspace(ch))
	ch = safegetc(ifp, "MungeField");
    ungetc(ch, ifp);	  	       /* comma or close paren/brace */
}

/* ----------------------------------------------------------------- *\
|  int iskeychar(char c,int first_char)
|
|  Return 1 if c is a valid keyword character, and 0 otherwise.
|  The rules are different for the first character, so first_char
|  must be set non-zero to select that case.
|
|  The code in bibtex.web in the section on character set dependencies
|  creates a character classification array, id_class[], which permits
|  the following characters (verified by experiment too) in field
|  names:
|
|      A-Z a-z 0-9
|      ! $ & + - . / : ; < > ? @ [ \ ] ^ _ ` | ~ <DELete>
|
\* ----------------------------------------------------------------- */

int iskeychar(char c,int first_char)
{
    if (first_char)
        return (isalpha(c) ? 1 : 0);
    else if (isalpha(c) || isdigit(c))
	return (1);
    else if (strchr("!$&+-./:;<>?@[\\]^_`|~\177",(int)c) != (char*)NULL)
	return (1);
    else
	return (0);
}

/* ----------------------------------------------------------------- *\
|  void MungeEntry(FILE *ifp, short entry)
|
|  Wander through the entry, extracting words and depositing them in
|  the appropriate hash tables.
\* ----------------------------------------------------------------- */
void MungeEntry(FILE *ifp, short entry)
{
    register char ch;
    Word thefield;
    short i;

    ch = safegetc(ifp, "MungeEntry");
    while (ch == ',')
    {
	ch = safegetc(ifp, "MungeEntry");
	while (isspace(ch))
	    ch = safegetc(ifp, "MungeEntry");

	if ((ch == '}') || (ch == ')'))	/* allow trailing comma after */
	    return;			/* last key = "value" entry */

	if (!isalpha(ch))
	{
	    char msg[2];
	    msg[0] = ch;
	    msg[1] = 0;
	    die("Nonalpha character in field descriptor:", msg);
	}

	for (i=0; iskeychar(ch,i==0); i++)
	{
	    if (i >= MAXWORD)
	    {
		thefield[i] = 0;
		die("field name buffer overflow", thefield);
	    }
	    thefield[i] = tolower(ch);
	    ch = safegetc(ifp, "MungeEntry");
	}
	ungetc(ch, ifp);		/* put back lookahead char */
	thefield[i] = 0;

	MungeField(ifp, GetHashTable(thefield), entry);

	ch = safegetc(ifp, "MungeEntry");
    }
}


/* ============================= OUTPUT ============================ */

/* ----------------------------------------------------------------- *\
|  IndexBibFile(FILE *ifp, FILE *ofp, char *filename)
|
|  Index a bibtex file.  Input comes from ifp; output goes to ofp.
|  Filename is the name of the bibliography, with no prefix.
\* ----------------------------------------------------------------- */
void
IndexBibFile(FILE *ifp, FILE *ofp, char *filename)
{
    short count=0;
    long curoffset;
    long *offsets;
    long *oldoff;
    short offsize;
    time_t now = time(0);

    printf("Indexing %s.bib...", filename);
    fflush(stdout);

    offsize = 128;		/* MINIMUM OFFSET LIST SIZE */
    offsets = (long *) malloc(offsize * sizeof(long));

    while (!feof(ifp))
    {
	curoffset = FindNextEntry(ifp);
	if (curoffset == (unsigned long) -1)
	    break;

	offsets[count] = curoffset;
	MungeEntry(ifp, count++);

	if (count == offsize)		/* expand full offset array */
	{
	    oldoff = offsets;
	    offsize *= 2;
	    offsets = (long *) malloc(offsize * sizeof(long));
	    bcopy(oldoff, offsets, count * sizeof(long));
	    free(oldoff);
	}

	if (!(count % 500))		/* feedback */
	{
	    printf("%d...", count);
	    fflush(stdout);
	}
    }

    printf("done.\n");

    fprintf(ofp, "bibindex %d %d %d %s", FILE_VERSION, MAJOR_VERSION,
	    MINOR_VERSION, ctime(&now));

    printf("Writing file offsets...");
    fflush(stdout);
    fwrite((void *) &count, sizeof(short), 1, ofp);
    fwrite((void *) offsets, sizeof(long), count, ofp);
    free(offsets);
    printf("[%d entries]\n", count);

    OutputTables(ofp);

    printf("All done!\n");
}

/* ----------------------------------------------------------------- *\
|  The main program
\* ----------------------------------------------------------------- */

int main(int argc, char **argv)
{
    FILE *ifp;
    FILE *ofp;
    char infile[256];
    char outfile[256];
    char *p;
    int i;

#if DEBUG_MALLOC
    malloc_debug(2);
#endif /* DEBUG_MALLOC */

    if (argc < 2)
	die("Usage: bibindex bib [-i field...]", "");

    if (((p = strrchr(argv[1],'.')) != (char*)NULL) &&
	(strcmp(p, ".bib") == 0))
	*p = '\0';			/* remove any .bib extension */

    sprintf(infile, "%s.bib", argv[1]);
    sprintf(outfile, "%s.bix", argv[1]);

    ifp = fopen(infile, "r");
    if (!ifp)
	die("Can't read", infile);

    ofp = fopen(outfile, "w");
    if (!ofp)
	die("Can't write", outfile);

    InitTables();

    if ((argc > 2) && (!strcmp(argv[2], "-i")))
	for (i=3; i<argc; i++)
	    InitBlackHole(argv[i]);

    IndexBibFile(ifp, ofp, argv[1]);

    FreeTables();
    fclose(ifp);
    fclose(ofp);

    exit(0);		/* Argh! */
    return (0);		/* keep compilers happy */
}
