bibjoin should be applied to a bibliography file only after entries have been suitably ordered so that candidates for joining appear consecutively. This can be done mostly automatically if standardized citation labels are first generated, perhaps by biblabel(1) and citesub(1), or by the GNU emacs(1) bibtex-insert-standard-BibNet-citation-label function from the bibtools library, then the bibliography is sorted by citation labels, such as by bibsort(1).
Only a human reader can reliably decide when two bibliography entries are truly the same. bibjoin can help automate much of this work, but manual editing will almost certainly still be necessary. If two entries are joined, these conditions must be satisfied:
- identical citation labels;
- identical year;
- if CODENs are given in both entries, the CODEN lists must be identical;
- if ISBNs are given in both entries, the ISBN lists must be identical;
- if ISSNs are given in both entries, the ISSN lists must be identical;
- if a journal article entry, identical volume, and if both have page numbers, identical initial page numbers.
An empty value, or a value containing only space and/or question marks, is equivalent to an omitted value for the purposes of these comparisons. The reason for this choice is that question marks have proved to be useful indicators of unknown values, distinguished from omitted values.
When two `equal' value strings are found for the same key, one of them is normally deleted. Otherwise, both key/value pairs are output. Manual editing will then be required to choose between them.
Special handling is supplied for `author' and `editor' fields. When a personal name appears in two forms, one with initials, and one without, such as `P. D. Q. Bach' and `Philippe D. Q. Bach', the names are considered to match, and the longer form is retained. In addition, to deal with the UnCover database practice of omitting authors 3, 4, ..., N-1, two author/editor personal name lists are considered to match if one has 3 names and the other more than 3, and the first, second, and last match as above; the longer form is retained.
Special handling is supplied for `bibdate' fields, provided they are in either of the forms
If either of the values is unrecognized, then separate key/value pairs are preserved. Otherwise, only the more recent of the two dates is kept.Wed Jul 6 15:27:50 1994 Wed Jul 6 15:27:50 MDT 1994
Special handling is supplied for `pages' entries. If entries are found with identical initial page numbers, but one of them has question marks in place of the final page number, or has no final page number at all, such as "123--127", "123--??", and "123", then the ones with the question marks or no final page numbers will be dropped. This facilitates merging in data from library databases that do not record final page numbers.
Value strings are considered equal if they match after all characters other than letters, digits, and plus are removed, and letter case is ignored. (The default set of retained characters can be redefined via the -ignore-characters regexp option described later.) For `title' entries, leading words `A', `An', `On', and `The' are ignored, because some library databases drop them. Value strings are also considered to match if one is an exact prefix of the other, because truncation of author lists and titles is a common problem in journal databases. This fuzzy equality helps to eliminate many match failures that arise from minor variations in punctuation, spacing, and capitalization. bibjoin has no way of determining which of the two strings should be preserved, so it uniformly discards the shorter one (which presumably has less `information'): this choice will frequently be wrong! The shorter string will be preserved if the -keep-duplicate-values option described later is used.
If two title or booktitle strings have the same length, and match when letter case is ignored, then the one with more capitalized words is saved. The reason for this choice is that some library databases arbitrarily downcase titles, losing information that should be preserved.
Syntax errors in the input stream will cause abrupt termination with a fatal error message and a non-zero exit code. The output will be incomplete, so you should always examine the output file before assuming that you can replace the input file with the output file.
If the -keep-duplicate-values option has been specified, then key/value pairs in output entries are sorted alphabetically by key name, so that duplicate keys arising from the join operation appear consecutively, simplifying the subsequent manual editing task. Otherwise, keys are ordered according to the conventions of biborder(1).
After completion of manual corrections, it is recommended that the bibliography be processed by biborder(1) to standardize key/value order (if the -keep-duplicate-values option was used), and to check for any remaining duplicate keys or citation labels.
To avoid confusion with options, if a filename begins with a hyphen, it must be disguised by a leading absolute or relative directory path, e.g. /tmp/-foo.bib or ./-foo.bib.
OPTvolume = "??",The OPT prefix ensures that the key is ignored by BibTeX, so that the question marks will not appear in an output .bbl file. The GNU Emacs bibtex-mode editing support has functions for removing the OPT prefixes, and so does bibclean(1).
The doubled question marks are distinguished from single ones that might legitimately appear in value strings, and also serve as a convenient regular-expression pattern for bibextract(1), allowing easy preparation of a printed listing of just those entries that have incomplete bibliographic data:
bibextract '' '[?][?]' BibTeXfiles | lpr
Nelson H. F. Beebe, Ph.D. Center for Scientific Computing Department of Mathematics University of Utah Salt Lake City, UT 84112 Tel: +1 801 581 5254 FAX: +1 801 581 4148 Email: <email@example.com> WWW URL: http://www.math.utah.edu/~beebe