.\" ====================================================================
.\"  @Troff-man-file{
.\"     author          = "Nelson H. F. Beebe",
.\"     version         = "0.08",
.\"     date            = "16 January 1999",
.\"     time            = "15:05:06 MST",
.\"     filename        = "bibjoin.man",
.\"     address         = "Center for Scientific Computing
.\"                        University of Utah
.\"                        Department of Mathematics, 322 INSCC
.\"                        155 S 1400 E RM 233
.\"                        Salt Lake City, UT 84112-0090
.\"                        USA",
.\"     telephone       = "+1 801 581 5254",
.\"     FAX             = "+1 801 585 1640, +1 801 581 4148",
.\"     checksum        = "44328 417 1877 13139",
.\"     email           = "beebe@math.utah.edu, beebe@acm.org,
.\"                        beebe@ieee.org (Internet)",
.\"     codetable       = "ISO/ASCII",
.\"     keywords        = "bibliography, BibTeX, ordering",
.\"     supported       = "yes",
.\"     docstring       = "This file contains the UNIX manual pages
.\"                        for the bibjoin utility, a program for
.\"                        ordering key fields in BibTeX bibliography
.\"                        files.
.\"
.\"                        The checksum field above contains a CRC-16
.\"                        checksum as the first value, followed by the
.\"                        equivalent of the standard UNIX wc (word
.\"                        count) utility output of lines, words, and
.\"                        characters.  This is produced by Robert
.\"                        Solovay's checksum utility.",
.\"  }
.\" ====================================================================
.if t .ds Bi B\s-2IB\s+2T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
.if n .ds Bi BibTeX
.if t .ds Te T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
.if n .ds Te TeX
.TH BIBJOIN 1 "16 January 1999" "Version 0.08"
.\"======================================================================
.SH NAME
bibjoin \- join duplicate or similar entries in a BibTeX bibliography file
.\"======================================================================
.SH SYNOPSIS
.B bibjoin
.RB [ \-author ]
.RB [ \-check-missing ]
.RB [ \-copyleft ]
.RB [ \-copyright ]
.RB [ \-ignore-characters
.IR regexp ]
.RB [ \-keep-duplicate-values ]
.RB [ \-version ]
[BibTeXfile(s) or < infile] > outfile
.\"======================================================================
.SH DESCRIPTION
.B bibjoin
filters one or more \*(Bi\& bibliographies, or
bibliography fragments, from the specified files,
or from its standard input if no filenames are
provided, printing on standard output a
bibliography in which adjacent duplicate, or
similar, entries have been joined into one entry.
Such action may be necessary when bibliography
entries are collected from many sources.
.PP
.B bibjoin
should be applied to a bibliography file only
after entries have been suitably ordered so that
candidates for joining appear consecutively.  This
can be done mostly automatically if standardized
citation labels are first generated, perhaps by
.BR biblabel (1)
and
.BR citesub (1),
or by the GNU
.BR emacs (1)
.I bibtex-insert-standard-BibNet-citation-label
function from the
.I bibtools
library, then the bibliography is sorted by
citation labels, such as by
.BR bibsort (1).
.PP
Only a human reader can reliably decide when two
bibliography entries are truly the same.
.B bibjoin
can help automate much of this work, but manual
editing will almost certainly still be necessary.
If two entries are joined, these conditions must
be satisfied:
.RS
.TP \w'\(bu'u+1n
\(bu
identical citation labels;
.TP
\(bu
identical year;
.TP
\(bu
if CODENs are given in both entries, the
CODEN lists must be identical;
.TP
\(bu
if ISBNs are given in both entries, the
ISBN lists must be identical;
.TP
\(bu
if ISSNs are given in both entries, the
ISSN lists must be identical;
.TP
\(bu
if a journal article entry, identical volume, and
if both have page numbers, identical initial page
numbers.
.RE
.PP
An empty value, or a value containing only space
and/or question marks, is equivalent to an omitted
value for the purposes of these comparisons.  The
reason for this choice is that question marks have
proved to be useful indicators of
.I unknown
values, distinguished from
.I omitted
values.
.PP
When two `equal' value strings are found for the
same key, one of them is normally deleted.
Otherwise, both key/value pairs are output.
Manual editing will then be required to choose
between them.
.PP
Special handling is supplied for `author' and
`editor' fields.  When a personal name appears in
two forms, one with initials, and one without,
such as `P. D. Q. Bach' and `Philippe D. Q. Bach',
the names are considered to match, and the longer
form is retained.  In addition, to deal with the
UnCover database practice of omitting authors 3,
4, ..., N-1, two author/editor personal name lists
are considered to match if one has 3 names and the
other more than 3, and the first, second, and last
match as above; the longer form is retained.
.PP
Special handling is supplied for `bibdate' fields,
provided they are in either of the forms
.RS
.nf
Wed Jul 6 15:27:50 1994
Wed Jul 6 15:27:50 MDT 1994
.fi
.RE
If either of the values is unrecognized, then
separate key/value pairs are preserved.
Otherwise, only the more recent of the two dates
is kept.
.PP
Special handling is supplied for `pages' entries.
If entries are found with identical initial page
numbers, but one of them has question marks in
place of the final page number, or has no final
page number at all, such as "123--127", "123--??",
and "123", then the ones with the question marks
or no final page numbers will be dropped.  This
facilitates merging in data from library databases
that do not record final page numbers.
.PP
Value strings are considered equal if they match
after all characters other than letters, digits,
and plus are removed, and letter case is ignored.
(The default set of retained characters can be
redefined via the
.BI \-ignore-characters " regexp"
option described later.)  For `title' entries,
leading words `A', `An', `On', and `The' are
ignored, because some library databases drop them.
Value strings are also considered to match if one
is an exact prefix of the other, because
truncation of author lists and titles is a common
problem in journal databases.  This fuzzy equality
helps to eliminate many match failures that arise
from minor variations in punctuation, spacing, and
capitalization.
.B bibjoin
has no way of determining which of the two strings
should be preserved, so it uniformly discards the
shorter one (which presumably has less
`information'): this choice will frequently be
.IR wrong !
The shorter string will be preserved if the
.B \-keep-duplicate-values
option described later is used.
.PP
If two
.I title
or
.I booktitle
strings have the same length, and match when
letter case is ignored, then the one with more
capitalized words is saved.  The reason for this
choice is that some library databases arbitrarily
downcase titles, losing information that should be
preserved.
.PP
Syntax errors in the input stream will cause
abrupt termination with a fatal error message and
a non-zero exit code.  The output will be
incomplete, so you should always examine the
output file before assuming that you can replace
the input file with the output file.
.PP
If the
.B \-keep-duplicate-values
option has been specified, then key/value pairs in
output entries are sorted alphabetically by key
name, so that duplicate keys arising from the join
operation appear consecutively, simplifying the
subsequent manual editing task.  Otherwise, keys
are ordered according to the conventions of
.BR biborder (1).
.PP
After completion of manual corrections,
it is recommended that the bibliography be
processed by
.BR biborder (1)
to standardize key/value order (if the
.B \-keep-duplicate-values
option was used), and to check for any remaining
duplicate keys or citation labels.
.\"======================================================================
.SH OPTIONS
Command-line options may be abbreviated to a
unique leading prefix.  The leading hyphen that
distinguishes an option from a filename may be
doubled, for compatibility with GNU and POSIX
conventions.  Thus,
.B \-author
and
.B \-\-author
are equivalent.
.PP
To avoid confusion with options, if a filename
begins with a hyphen, it must be disguised by a
leading absolute or relative directory path, e.g.
.I /tmp/-foo.bib
or
.IR ./-foo.bib .
.\"-----------------------------------------------
.TP \w'\-ignore-characters-regexp'u+3n
.B \-author
Print author information on
.I stderr
and exit immediately with a successful status code.
.\"-----------------------------------------------
.TP
.B \-check-missing
If this option is specified, missing expected key
fields will be supplied, with the key field name
prefixed with OPT, and the value string set to a
pair of question marks, e.g.
.nf
  OPTvolume =    "??",
.fi
The
.I OPT
prefix ensures that the key is ignored by \*(Bi\&,
so that the question marks will not appear in an
output
.I .bbl
file.  The GNU Emacs
.I bibtex-mode
editing support has functions for removing the OPT
prefixes, and so does
.BR bibclean (1).
.IP
The doubled question marks are distinguished from
single ones that might legitimately appear in
value strings, and also serve as a convenient
regular-expression pattern for
.BR bibextract (1),
allowing easy preparation of a printed listing of
just those entries that have incomplete
bibliographic data:
.nf
.BI "     bibextract" " '' '[?][?]' BibTeXfiles " "|  lpr"
.fi
.\"-----------------------------------------------
.TP
.B \-copyleft
Print copyright information on
.I stderr
and exit immediately with a successful status code.
.\"-----------------------------------------------
.TP
.B \-copyright
Print copyright information on
.I stderr
and exit immediately with a successful status code.
.\"-----------------------------------------------
.TP
.BI \-ignore-characters " regexp"
Specify a regular expression to define the set of
characters to be ignored in value string
comparisons.  The default is
.IR '[^A-Za-z0-9+]' .
.\"-----------------------------------------------
.TP
.B \-keep-duplicate-values
Instead of discarding the shorter of two value
strings that are considered `equal', preserve the
shorter of them using the key suffixed with the
letter `z',
e.g.,
.I title
and
.IR titlez .
If such a key already exists, add additional
suffixing `z' letters to make the key unique.
.\"-----------------------------------------------
.TP
.B \-version
Display the
.B bibjoin
version number and date on
.I stderr
and exit immediately with a successful status code.
.\"======================================================================
.SH "WARNING AND ERROR MESSAGES"
.B bibjoin
will issue warning messages in the following
cases:
.TP \w'\(bu'u+1n
\(bu
With
.BR \-check-missing ,
for unrecognized \*(Bi\& entry types.  The entry
will be output without checking for missing key
names.
.TP
\(bu
For duplicate key names.  Such key/value pairs are
sorted together by name, preserving their original
order.
.TP
\(bu
When identical key/value pairs are reduced to a
single pair by discarding duplicates.
.PP
.B bibjoin
will issue an error message and terminate with
exit code 1, and
.IR "incomplete output" ,
in the following cases:
.TP \w'\(bu'u+1n
\(bu
for an unrecognized command-line argument (only
the minimal unique prefix of each option is
currently examined);
.TP
\(bu
end-of-file is reached while collecting an entry
or value;
.TP
\(bu
a line beginning with `@' is encountered while
collecting an entry, before balanced braces have
been found.
.\"======================================================================
.SH CAVEATS
\*(Bi\& has loose syntactical requirements that
the current simple implementation of
.B bibjoin
does not support.  In particular, outer
parentheses may
.I not
be used in place of braces following ``@keyword''
patterns.  If you have such a file, you can use
.BR bibclean (1)
to prettyprint it into a form that
.B bibjoin
can handle successfully.
.\"======================================================================
.SH "SEE ALSO"
.BR bibcheck (1),
.BR bibclean (1),
.BR bibdup (1),
.BR bibextract (1),
.BR biblabel (1),
.BR biblex (1),
.BR biborder (1),
.BR bibparse (1),
.BR bibsearch (1),
.BR bibsort (1),
.BR bibtex (1),
.BR bibunlex (1),
.BR citesub (1),
.BR emacs (1).
.\"======================================================================
.SH AUTHOR
.nf
Nelson H. F. Beebe, Ph.D.
Center for Scientific Computing
University of Utah
Department of Mathematics, 322 INSCC
155 S 1400 E RM 233
Salt Lake City, UT 84112-0090
Tel: +1 801 581 5254
FAX: +1 801 585 1640, +1 801 581 4148
Email: <beebe@math.utah.edu>, <beebe@acm.org>, <beebe@ieee.org>
WWW URL: http://www.math.utah.edu/~beebe
.fi
.\"==============================[The End]==============================
.\" This is for GNU Emacs file-specific customization:
.\" Local Variables:
.\" fill-column: 50
.\" End:
